Fahd Mirza - getting Whisper working on Google Colab
https://www.youtube.com/watch?v=0Rdf2XA9G9Y Real time ASR - automated speech recognition
This video provides a comprehensive guide on performing approximate real-time live transcription using OpenAI’s whisper-large-v3-turbo model within a free Google Colab environment.
The speaker begins by introducing the whisper-large-v3-turbo model as a state-of-the-art solution for Automatic Speech Recognition (ASR) and speech translation. He clarifies that this model is a fine-tuned, pruned version of the original whisper-large-v3, specifically optimizing for speed by reducing the number of decoder layers from 32 to 4. This modification results in significantly faster inference with only a “very slight minor quality degradation” that is largely ignorable. The model’s robustness stems from its training on approximately 5 million hours of labeled data, enabling it to generalize effectively across diverse datasets and domains in a zero-shot setting.
The video then briefly highlights its sponsor, AgentQL, a structured query language designed for “painless data extraction and web automation.” AgentQL utilizes AI to analyze page structures, providing a resilient and reusable alternative to traditional web scraping methods like XPath and DOM selectors.
For the practical demonstration, the speaker guides viewers through setting up a Google Colab notebook:
- Environment Setup: He recommends selecting a free T4 GPU in the Colab runtime settings to accelerate processing, expressing appreciation for Google’s provision of this resource.
- Library Installation: The necessary
transformersandgradiolibraries are installed directly from their Hugging Face GitHub repositories. - Model Loading: The
whisper-large-v3-turbomodel is loaded using thepipelinefunction, specifying the ‘automatic-speech-recognition’ task and the model checkpoint. The model, which is roughly 1.62 GB, along with its associated configuration, tokenizer, and preprocessor files, are automatically downloaded and loaded onto the GPU. - Transcription Testing: Remote File Transcription: The model’s immediate capability is tested by providing a URL to a sample FLAC audio file. The transcription appears rapidly and accurately. Local File Transcription: The speaker demonstrates uploading a local WAV audio file (a segment of a JFK speech) and successfully transcribing it, confirming the model’s ability to handle locally stored audio.
Finally, the core of the video demonstrates real-time live transcription using a Gradio interface:
- A Python script is presented that integrates the loaded
whisper-large-v3-turbomodel with Gradio. This interface allows users to record audio directly from their microphone or upload a file. - Upon launching the Gradio demo, a public URL is generated. When accessed, the interface provides a “Record” button, a waveform display, and an “Output” text area for the transcription.
- During a live recording, the transcribed text appears almost immediately, segment by segment. The interface also displays the latency, which initially takes about 5 seconds for the “first token” but then drops significantly to around 0.2-0.3 seconds for subsequent processing, making it feel very close to real-time. The speaker highlights this impressive performance for a free Colab environment, noting the model’s accuracy and even its ability to correctly transcribe some non-English words.
The video concludes by encouraging viewers to subscribe to the channel and share the content if they found it useful, emphasizing the accessibility of such powerful AI models.