Today we’re going to be looking into Whiper & Python for video transcription. Whisper is a library that has been around for a while, a while being 2 years. Despite its 2022 debut, this “vintage” marvel stands tall in the audio transcription arena, able to listen to audio streams and convert into a full transcript. Of course then 2023 happened and the spotlight was passed on to the new kid, hi there ChatGPT.  

So why then are we coming back around to this now 2 year old model? I was recently exploring the idea of reviewing video transcripts and although there are options available if your video is on YouTube & already transcribed I didn’t want that. I was dealing with a lot of stored video files, mostly in MP4 format, and wanted a way to extract summaries of their content. So my search led me to Whisper as an option for local video transcription that works with minimal effort. Let’s take a look.  

What is Whisper?

So let’s go a little bit further into exactly what Whisper is. It is designed by OpenAI for transcription in multiple languages and translations from those languages into English. Much like its sibling, Whisper is an encoder-decoder transformer model. The tasks on which the model were trained are:

  • Speech Recognition
  • Speech Translation
  • Spoken Language Detection
  • Voice Activity Detection

Each of these tasks are represented as a sequence of tokens to be predicted by the decoder. What we get is a single model which can efficiently replace multiple stages in a speech processing pipeline. 

Image Source [OpenAI Github]

Whisper was trained on a large and diverse training set for 680k hours of voice across multiple languages, with one third of the training data being non-english language. This diverse set of training data gives Whisper a distinct advantage in multiple speech recognition and detection problems. In fact, the model was also trained to translate language to english and performs quite well at this task. 

Now that we have a better understanding of what the Whisper model is, let’s jump into an example and see it in action. 

Simple Example

For the purpose of this example we are going to work with an English language video which was recorded on my laptop. In this I was just playing around and recorded a quick start guide on using Poetry. Now that we know what we are solving for, let’s set up the environment. 

Environment

Whisper requires ffmpeg, a CLI tool, to be installed on the system where the application will be run. Below are a variety of ways that you can install this requirement in your environment:

  • Ubuntu or Debian: sudo apt update && sudo apt install ffmpeg
  • Arch Linux: sudo pacman -S ffmpeg
  • MacOS: using Homebrew (https://brew.sh/) – brew install ffmpeg
  • Windows:  using Chocolatey (https://chocolatey.org/) – choco install ffmpeg
  • Windows: using Scoop (https://scoop.sh/) – scoop install ffmpeg

With that system dependency taken care of let’s create our new Python app. Create a new directory, navigate to it and create a new virtual environment to manage our application’s dependencies. Activate the environment and start installing the dependencies. We will be using two packages for this tutorial:

  1. moviepy (https://pypi.org/project/moviepy/) – This package will be used to extract the audio from the video file. <add more about it>
  2. Openai-whisper (https://pypi.org/project/openai-whisper/) – The main package used to download and run the various Whisper models.

Available Models

Whisper has a number of models, each trained with an increasing number of parameters. Before we move into the code to use Whisper, let’s take a moment to understand what is available to you. 

The below table shows these models. Note that when no model is specified the base model is chosen by default. 

SizeParametersEnglish-only modelMultilingual modelRequired VRAMRelative speed
tiny39 Mtiny.entiny~1 GB~32x
base74 Mbase.enbase~1 GB~16x
small244 Msmall.ensmall~2 GB~6x
medium769 Mmedium.enmedium~5 GB~2x
large1550 MN/Alarge~10 GB1x

For our purposes we are sticking with English and also running this on a local machine. We will therefore choose the base.en model. From their own documentation, OpenAI identifies that both small & medium perform better when specifying the english version. This makes sense as the model has a more focused outcome and English audio makes up 66% of the training set. If your use case requires translating non English audio keep in mind that performance is not uniform on all languages.

You can also get specific and tell Whisper what language to translate from. You can do this by setting the –language flag accordingly. Example:

whisper japanese.wav –language Japanese

Extracting Audio

Now that we have a better understanding of the model options let’s get into an example. For this task we are going to use a local MP4 file, extract the audio from it and then use Whisper to generate the transcription. 

We will use the moviepy package to extract the audio. Create a new file called transcribe.py and add the following code:

import moviepy.editor as mp
import os


# Setup some base value for the module
BASE_PATH = os.getcwd()
AUDIO_BASE = f"{BASE_PATH}/audio"
VIDEO_BASE = f"{BASE_PATH}/video"


def extract_audio_to_file(video_path, audio_path):
    """
    Uses the moviepy package to extract and write
    audio content to a ne file
    """
    # Load the video from file
    video = mp.VideoFileClip(video_path)


    # Extract the audio file from the video.
    # The codec is chosen to be a compatible format for Whisper
    video.audio.write_audiofile(audio_path, codec='pcm_s16le')


if __name__ == "__main__":
    filename = "test"
    audio_path = f"{AUDIO_BASE}/{filename}.wav"
    video_path = f"{VIDEO_BASE}/{filename}.mp4"
    transcript = extract_audio_to_file(video_path, audio_path)

If you place an mp4 into the relevant directory and run the file you should have an extracted audio file for the mp4 video. Most of the above is pretty straightforward. What is interesting is the choice of codec. You must specify a format which Whisper can understand, in this case pcm_s16le. You can read more about the different codec on ffmpeg’s website here (https://trac.ffmpeg.org/wiki/audio%20types).

Transcription

We’re nearly there, we now have an audio file for the video. Next is to have Whisper perform the transcription of that audio file. This is actually quite simple, first we need to specify which model we want to use (refer to above). With your choice made simply pass the audio file to the transcribe method, reading the results like so. 

def video_to_transcript_with_whisper(video_path, audio_path):
    extract_audio_to_file(video_path, audio_path)
   
    # First grab the relevant model for the task at hand
    model = whisper.load_model("base.en")  
   
    # Transcribe the audio file using the selected model
    result = model.transcribe(audio_path)
   
    return result["text"]

Whisper operates locally on your machine or environment, allowing more possibilities. The first time you run this it will download the model, this may take some time depending on the model you choose or your network connection. Regardless, once downloaded, Whisper will start working through your audio file and produce a very good transcript of the audio. 

Bringing it all together

With all these pieces we now have a working example of local video transcription. When run end to end it will convert a video to an audio file and then transcribe the content of that file, allowing you to transform video to text locally. 

import moviepy.editor as mp
import whisper
import os

BASE_PATH = os.getcwd()
AUDIO_BASE = f"{BASE_PATH}/audio"
VIDEO_BASE = f"{BASE_PATH}/video"

def extract_audio_to_file(video_path, audio_path):
    """
    Uses the moviepy package to extract and write
    audio content to a ne file
    """
    # Load the video from file
    video = mp.VideoFileClip(video_path)

    # Extract the audio file from the video.
    # The codec is chosen to be a compatible format for Whisper
    video.audio.write_audiofile(audio_path, codec='pcm_s16le')

def video_to_transcript_with_whisper(video_path, audio_path):
    extract_audio_to_file(video_path, audio_path)
   
    # First grab the relevant model for the task at hand
    model = whisper.load_model("base.en")  
   
    # Transcribe the audio file using the selected model
    result = model.transcribe(audio_path)
   
    return result["text"]

if __name__ == "__main__":
    filename = "test"
    audio_path = f"{AUDIO_BASE}/{filename}.wav"
    video_path = f"{VIDEO_BASE}/{filename}.mp4"
    transcript = video_to_transcript_with_whisper(video_path, audio_path)
    print(transcript)

Wrapping Up

That’s Whisper, from its grand entrance to the tech stage in 2022, right down to the nitty-gritty of turning your dusty old videos into shiny transcripts. With its multilingual prowess and cost-free charm, a powerful model to serve a very interesting purpose.

What are you waiting for? Grab that video file, let Whisper work its magic, and transform the spoken word into written treasure. After all, in the realm of digital alchemy, you’re only a few commands away from gold.

Happy transcribing!

Shares:

1 Comment

Comments are closed