Whisper is a state-of-the-art automatic speech recognition (ASR) system, trained on an extensive 680,000 hours of multilingual and multitask supervised data collected from the web. This vast and diverse dataset has enabled Whisper to achieve near-human level robustness and accuracy in English speech recognition.

Background and Development

Whisper was developed with the aim of improving robustness to accents, background noise, and technical language. Its training on a large and diverse dataset has not only achieved this but also enabled transcription in multiple languages and translation from those languages into English.

Core Features and Capabilities

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. It is capable of language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

