[Paper] [Models] [Colab] [Training Code]
Distil-Whisper is a distilled version of Whisper for English speech recognition that is 6 times faster, 49% smaller, and performs within 1% word error rate (WER) on out-of-distribution evaluation sets:
| Model | Params / M | Rel. Latency ↑ | Short-Form WER ↓ | Long-Form WER ↓ |
|---|---|---|---|---|
| large-v3 | 1550 | 1.0 | 8.4 | 11.0 |
| distil-large-v3 | 756 | 6.3 | 9.7 | 10.8 |
| distil-large-v2 | 756 | 5.8 | 10.1 | 11.6 |
| distil-medium.en | 394 | 6.8 | 11.1 | 12.4 |
| distil-small.en | 166 | 5.6 | 12.1 | 12.8 |
For most applications, we recommend the latest distil-large-v3 checkpoint, since it is the most performant distilled checkpoint and compatible across all Whisper libraries. The only exception is resource-constrained applications with very little memory, such as on-device or mobile applications, where the distil-small.en is a great choice, since it is only 166M parameters and performs within 4% WER of Whisper large-v3.
[!NOTE]
Distil-Whisper is only available for English speech recognition. For multilingual speech recognition, we recommend using the Whisper Turbo checkpoint, which was released by OpenAI and leverages the same principles as Distil-Whisper. For details, refer to the Whisper turbo release statement.
Distil-Whisper is supported in Hugging Face 🤗 Transformers from version 4.35 onwards. To run the model, first install the latest version of the Transformers library. For this example, we'll also install 🤗 Datasets to load a toy audio dataset from the Hugging Face Hub:
pip install --upgrade pip
pip install --upgrade transformers accelerate datasets[audio]
Short-form transcription is the process of transcribing audio samples that are less than 30-seconds long, which is the maximum receptive field of the Whisper models. This means the entire audio clip can be processed in one go without the need for chunking.
First, we load Distil-Whisper via the convenient AutoModelForSpeechSeq2Seq and AutoProcessor classes.
We load the model in float16 precision and make sure that loading time takes as little time as possible by passing low_cpu_mem_usage=True.
In addition, we want to make sure that the model is loaded in safetensors format by passing use_safetensors=True:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
The model and processor can then be passed to the pipeline.
Note that if you would like to have more control over the generation process, you can directly make use of model + processor API as shown below.
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device,
)
Next, we load an example short-form audio from the LibriSpeech corpus:
from datasets import load_dataset
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
sample = dataset[0]["audio"]
Finally, we can call the pipeline to transcribe the audio:
result = pipe(sample)
print(result["text"])
To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
result = pipe("audio.mp3")
print(result["text"])
For more information on how to customize the automatic speech recognition pipeline, please refer to the ASR pipeline docs. We also provide an end-to-end Google Colab that benchmarks Whisper against Distil-Whisper.
For more control over the generation parameters, use the model + processor API directly:
Ad-hoc generation arguments can be passed to model.generate, including num_beams for beam-search, return_timestamps
for segment-level timestamps, and prompt_ids for prompting. See the docstrings
for more details.
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]
input_features = processor(
sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt"
).input_features
input_features = input_features.to(device, dtype=torch_dtype)
gen_kwargs = {
"max_new_tokens": 128,
"num_beams": 1,
"return_timestamps": False,
}
pred_ids = model.generate(input_features, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=gen_kwargs["return_timestamps"])
print(pred_text)
The latest distil-large-v3 checkpoint is specifically designed to be compatible with OpenAI's sequential long-form transcription algorithm. This algorithm uses a sliding window for buffered inference of long audio files (> 30-seconds), and returns more accurate transcriptions compared to the chunked long-form algorithm.
The sequential long-form algorithm should be used in either of the following scenarios: 1. Transcription accuracy is the most important factor, and latency is less of a consideration 2. You are transcribing batches of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
If you are transcribing single long audio files and latency is the most important factor, you should use the chunked algorithm described below. For a detailed explanation of the different algorithms, refer to Sections 5 of the Distil-Whisper paper.
We start by loading the model and processor as before:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
The model and processor can then be passed to the pipeline.
Note that if you would like to have more control over the generation process, you can directly make use of model.generate(...) API as shown below.
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device,
)
Next, we load a long-form audio sample. Here, we use an example of concatenated samples from the LibriSpeech corpus:
from datasets import load_dataset
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
Finally, we can call the pipeline to transcribe the audio:
result = pipe(sample)
print(result["text"])
To transcribe a local audio file, simply pass the path to your audio file when you call the pipeline:
result = pipe("audio.mp3")
print(result["text"])
For more control over the generation parameters, use the model + processor API directly:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
from datasets import Audio, load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation")
dataset = dataset.cast_column("audio", Audio(processor.feature_extractor.sampling_rate))
sample = dataset[0]["audio"]
inputs = processor(
sample["array"],
sampling_rate=sample["sampling_rate"],
return_tensors="pt",
truncation=False,
padding="longest",
return_attention_mask=True,
)
inputs = inputs.to(device, dtype=torch_dtype)
gen_kwargs = {
"max_new_tokens": 448,
"num_beams": 1,
"condition_on_prev_tokens": False,
"compression_ratio_threshold": 1.35, # zlib compression ratio threshold (in token space)
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
"logprob_threshold": -1.0,
"no_speech_threshold": 0.6,
"return_timestamps": True,
}
pred_ids = model.generate(**inputs, **gen_kwargs)
pred_text = processor.batch_decode(pred_ids, skip_special_tokens=True, decode_with_timestamps=False)
print(pred_text)
distil-large-v3 remains compatible with the Transformers chunked long-form algorithm. This algorithm should be used when a single large audio file is being transcribed and the fastest possible inference is required. In such circumstances, the chunked algorithm is up to 9x faster than OpenAI's sequential long-form implementation (see Table 7 of the Distil-Whisper paper).
We can load the model and processor as before:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "distil-whisper/distil-large-v3"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
To enable chunking, pass the chunk_length_s parameter to the pipeline. For distil-large-v3, a chunk length of 25-seconds
is optimal. To activate batching, pass the argument batch_size:
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=25,
batch_size=16,
torch_dtype=torch_dtype,
device=device,
)
The argument max_new_tokens controls the maximum number of generated tokens per-chunk. In the typical speech setting,
we have no more than 3 words spoken per-second. Therefore, for a 30-second input, we have at most 90 words (approx 128 tokens).
We set the maximum number of generated tokens per-chunk to 128 to truncate any possible hallucinations that occur at the
end of the segment. These tokens get removed at the chunk borders using the long-form chunking transcription algor
$ claude mcp add distil-whisper \
-- python -m otcore.mcp_server <graph>