Contents

How to Fine-Tune Whisper for Domain-Specific Speech Recognition

OpenAI’s Whisper is one of the best open-source speech recognition models available. Out of the box, whisper-large-v3-turbo hits roughly 8% word error rate (WER) on general English benchmarks like LibriSpeech. But point it at radiology reports, esports commentary, legal depositions, or manufacturing SOPs and that number can spike to 30-50%. The model simply has not seen enough of those specialized terms during pre-training to transcribe them reliably.

You can fix this. Fine-tuning Whisper on a small dataset of domain-specific audio - as little as one to three hours - with LoRA adapters brings domain-term WER down by 30-60%. The entire training process fits on a single consumer GPU with 12-16 GB of VRAM, takes a couple of hours, and produces an adapter file under 100 MB. What follows is the full process from data preparation through deployment.

When to Fine-Tune Instead of Prompting or Post-Processing

Whisper supports an initial_prompt parameter that biases its decoder toward expected terminology. If you only need to get 5-10 key terms right, this is often enough. You pass the terms in a short prompt and Whisper adjusts its predictions accordingly.

For slightly messier cases, post-processing with a dictionary-based fuzzy matcher or a small language model can catch common misrecognitions after decoding. This adds latency and can introduce its own errors, but it works when the mistakes are predictable and the vocabulary is small.

Fine-tuning makes sense when:

  • You have more than 50 unique domain-specific terms that the base model consistently gets wrong
  • Accuracy on those terms is critical (medical transcription, legal records, compliance documentation)
  • You need consistent output formatting - medical abbreviations, chemical formulas, proprietary product names spelled exactly right
  • You are processing enough audio that even small per-file error rates compound into a real problem

The cost is modest. On an RTX 5080 with 16 GB VRAM, a LoRA fine-tuning run takes 2-4 hours of GPU time. The resulting adapter does not change inference speed at all. And you can merge it into the base model for deployment without any extra dependencies.

Which Model to Start From

Your main options in the Whisper family:

ModelParametersSpeedFine-tuning suitability
openai/whisper-large-v3-turbo809MFastBest accuracy-to-speed ratio, recommended starting point
openai/whisper-large-v31.5BModerateMarginally better accuracy, 2x slower, needs more VRAM
distil-whisper/distil-large-v3~756MFastestDistilled model, less responsive to fine-tuning

Whisper encoder-decoder architecture diagram showing audio converted to log-Mel spectrograms, processed by a Transformer encoder, then decoded into text tokens
Whisper's multitask architecture — audio is chunked into 30-second segments, converted to spectrograms, and decoded into text
Image: OpenAI Whisper , MIT License

Start with whisper-large-v3-turbo unless you have a specific reason to use the full 1.5B model. The turbo variant gives you the best tradeoff between transcription quality and resource consumption during both training and inference.

Preparing Your Domain-Specific Dataset

Data quality determines fine-tuning success more than any hyperparameter choice. A clean, well-segmented dataset of one hour will outperform a sloppy dataset of ten hours.

How Much Audio You Need

  • 1 hour of accurately transcribed domain audio gives noticeable improvement on key terms
  • 3+ hours produces strong, consistent results across the full domain vocabulary
  • Beyond 10 hours you hit diminishing returns for narrow domains, though broader domains covering multiple sub-specialties can benefit from more

Audio Format and Segmentation

Whisper expects 16kHz mono audio internally. Pre-convert everything to avoid inconsistencies during training:

ffmpeg -i input.mp3 -ar 16000 -ac 1 output.wav

Split long recordings into segments of 10-30 seconds, aligned to sentence boundaries. WhisperX (v3.3+) handles forced alignment well and gives you timestamps you can use to cut the audio. After automatic segmentation, you will need to manually correct the transcripts - this is the part where the work pays off.

Bootstrapping Transcripts

If you are starting from raw audio with no transcripts at all, run the base Whisper model first to generate rough transcriptions. Then correct the domain-specific terms by hand. This bootstrapping approach is 5-10x faster than transcribing from scratch because most of the common words will already be right - you are only fixing the specialized vocabulary.

Dataset Format

Hugging Face Datasets expects a DatasetDict with train and test splits. Each example needs:

  • An audio column containing the path to the WAV file
  • A transcription column with the ground-truth text
from datasets import DatasetDict, Dataset, Audio

train_data = Dataset.from_dict({
    "audio": train_audio_paths,
    "transcription": train_transcripts
}).cast_column("audio", Audio(sampling_rate=16000))

test_data = Dataset.from_dict({
    "audio": test_audio_paths,
    "transcription": test_transcripts
}).cast_column("audio", Audio(sampling_rate=16000))

ds = DatasetDict({"train": train_data, "test": test_data})

Validation Checklist

Before training, verify:

  • No audio leaks between train and test splits (same speaker in the same session should not appear in both)
  • Unicode and punctuation are normalized consistently
  • All audio files are readable and non-silent
  • Blank or near-silent segments are removed
  • Transcriptions match the actual audio content

Data Augmentation

If you are short on audio, the audiomentations library lets you stretch your dataset. Apply speed perturbation (0.9x to 1.1x) and add light background noise to create variations without recording additional audio. This improves model robustness, particularly for noisy real-world environments.

from audiomentations import Compose, TimeStretch, AddGaussianNoise

augment = Compose([
    TimeStretch(min_rate=0.9, max_rate=1.1, p=0.5),
    AddGaussianNoise(min_amplitude=0.001, max_amplitude=0.015, p=0.5),
])

Fine-Tuning with Hugging Face Transformers and LoRA

Full fine-tuning of whisper-large-v3-turbo requires 40+ GB of VRAM. LoRA sidesteps this by freezing the base model and training small adapter matrices on the attention layers. You end up updating roughly 2% of the total parameters, which fits comfortably in 11-12 GB of VRAM.

Install Dependencies

pip install "transformers[torch]>=4.48.0" datasets accelerate peft evaluate jiwer

Pin transformers>=4.48.0 to get the latest Whisper architecture support and bug fixes.

Load the Model and Processor

from transformers import WhisperForConditionalGeneration, WhisperProcessor

model_name = "openai/whisper-large-v3-turbo"
processor = WhisperProcessor.from_pretrained(model_name)
model = WhisperForConditionalGeneration.from_pretrained(model_name)

Configure LoRA

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="SEQ_2_SEQ_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: ~16M || all params: ~809M || trainable%: ~2.0

LoRA low-rank adaptation diagram showing the original frozen weight matrix W alongside small trainable matrices A and B that produce the weight update delta-W
LoRA decomposes weight updates into two small matrices — only ~2% of parameters are trainable while the rest stay frozen
Image: Hugging Face PEFT

Targeting q_proj and v_proj in the attention layers is the standard starting point. If results are not good enough, you can expand to k_proj and o_proj or increase the rank from 16 to 32. The same LoRA rank and target module choices apply across model types — the Stable Diffusion XL fine-tuning guide covers the same adapter pattern for image generation if you want to see LoRA applied to a different architecture.

Data Collator and Training Arguments

The data collator handles padding for variable-length audio inputs and sets up the decoder input IDs for English transcription:

from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
from dataclasses import dataclass
from typing import Any, Dict, List, Union
import torch

@dataclass
class DataCollatorSpeechSeq2SeqWithPadding:
    processor: Any

    def __call__(self, features: List[Dict[str, Union[List[int], torch.Tensor]]]) -> Dict[str, torch.Tensor]:
        input_features = [{"input_features": f["input_features"]} for f in features]
        batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")

        label_features = [{"input_ids": f["labels"]} for f in features]
        labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")

        labels = labels_batch["input_ids"].masked_fill(
            labels_batch.attention_mask.ne(1), -100
        )
        batch["labels"] = labels
        return batch

data_collator = DataCollatorSpeechSeq2SeqWithPadding(processor=processor)
training_args = Seq2SeqTrainingArguments(
    output_dir="./whisper-finetuned",
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-4,
    warmup_steps=100,
    max_steps=1000,
    fp16=True,
    evaluation_strategy="steps",
    eval_steps=100,
    save_steps=100,
    predict_with_generate=True,
    logging_steps=25,
    report_to="none",
)

Launch Training

import evaluate
import jiwer

wer_metric = evaluate.load("wer")

def compute_wer(pred):
    pred_ids = pred.predictions
    label_ids = pred.label_ids
    label_ids[label_ids == -100] = processor.tokenizer.pad_token_id
    pred_str = processor.tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    label_str = processor.tokenizer.batch_decode(label_ids, skip_special_tokens=True)
    wer = wer_metric.compute(predictions=pred_str, references=label_str)
    return {"wer": wer}

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=ds["train"],
    eval_dataset=ds["test"],
    data_collator=data_collator,
    compute_metrics=compute_wer,
)

trainer.train()

Expect roughly 2 hours for 1000 steps on an RTX 5080. VRAM usage should sit around 11 GB with batch size 4 and fp16 enabled. If you hit out-of-memory errors, drop the batch size to 2 and increase gradient_accumulation_steps to 8 to maintain the same effective batch size.

Evaluating and Iterating on Your Fine-Tuned Model

A single WER number on your test set does not tell the whole story. You need to measure domain-term accuracy specifically and verify that general transcription quality has not degraded.

Domain-Term Accuracy

Extract your list of domain-specific terms and compute two metrics:

  • Recall: what percentage of domain terms in the reference transcripts does the model transcribe correctly?
  • Precision: how often does the model hallucinate similar-sounding non-domain terms where none exist?
def domain_term_recall(references, hypotheses, domain_terms):
    hits = 0
    total = 0
    for ref, hyp in zip(references, hypotheses):
        for term in domain_terms:
            if term.lower() in ref.lower():
                total += 1
                if term.lower() in hyp.lower():
                    hits += 1
    return hits / total if total > 0 else 0.0

Baseline Comparison

Always run the unmodified base Whisper model on the same test set. Log both overall WER and domain-term WER in a comparison table:

MetricBase ModelFine-Tuned
Overall WER32.1%14.7%
Domain-term recall41%87%
Domain-term precision78%94%

These numbers are illustrative, but 30-60% relative WER reduction on domain terms is typical with 1-3 hours of training data.

Catastrophic Forgetting Check

Test on a slice of LibriSpeech or Common Voice to ensure general transcription quality has not degraded. If WER increases by more than 2% on general audio compared to the base model, your training is overfitting to the domain data. Reduce the number of training steps or lower the learning rate.

Iteration Strategies

If domain-term accuracy is still insufficient after your first training run:

  1. Add more training examples that contain the problematic terms
  2. Increase LoRA rank from 16 to 32 (trains more parameters, may capture more complex patterns)
  3. Train for more steps with a lower learning rate (try 5e-5 instead of 1e-4)
  4. Expand target_modules to include k_proj and o_proj in addition to q_proj and v_proj

Merge and Export

Once you are satisfied with the results, merge the LoRA adapter into the base model so you can deploy without the PEFT dependency:

model = model.merge_and_unload()
model.save_pretrained("./whisper-domain-merged")
processor.save_pretrained("./whisper-domain-merged")

Deploying Your Fine-Tuned Whisper Model

You have a trained model. Now you need to get it running somewhere other than your training machine.

Quick Local Inference

Load the merged model with the Hugging Face pipeline API:

from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model="./whisper-domain-merged",
    device="cuda",
)

result = pipe("recording.wav")
print(result["text"])

This gives you roughly 1.5x real-time speed on an RTX 5080 - fast enough for interactive use but not optimal for batch processing.

Faster Inference with CTranslate2

CTranslate2 and the faster-whisper library can deliver 3x speed improvement and 50% less VRAM usage compared to the standard Hugging Face pipeline:

ct2-opus-converter --model ./whisper-domain-merged \
    --output_dir ./whisper-ct2 \
    --quantization float16
from faster_whisper import WhisperModel

model = WhisperModel("./whisper-ct2", device="cuda", compute_type="float16")
segments, info = model.transcribe("recording.wav")

for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Docker API Endpoint

For production use, wrap the model in a FastAPI service:

from fastapi import FastAPI, UploadFile
from faster_whisper import WhisperModel
import tempfile

app = FastAPI()
model = WhisperModel("./whisper-ct2", device="cuda", compute_type="float16")

@app.post("/transcribe")
async def transcribe(file: UploadFile):
    with tempfile.NamedTemporaryFile(suffix=".wav") as tmp:
        tmp.write(await file.read())
        tmp.flush()
        segments, info = model.transcribe(tmp.name)
        return {
            "text": " ".join(s.text for s in segments),
            "language": info.language,
        }

Use nvidia/cuda:12.4.1-runtime-ubuntu24.04 as the Docker base image and install faster-whisper and fastapi with uvicorn in your Dockerfile.

Batch Processing

For bulk transcription of large audio collections, parallelize with the Datasets library:

from datasets import Dataset, Audio

audio_ds = Dataset.from_dict({"audio": audio_file_paths})
audio_ds = audio_ds.cast_column("audio", Audio(sampling_rate=16000))

def transcribe_batch(batch):
    results = pipe(batch["audio"], batch_size=8)
    batch["transcription"] = [r["text"] for r in results]
    return batch

transcribed = audio_ds.map(transcribe_batch, batched=True, batch_size=8)

Model Versioning

Tag each fine-tuned model with the training dataset version and evaluation scores. Push to a private Hugging Face Hub repository or store in a local model registry. Include the training configuration, WER results on both domain and general test sets, and the date of the training run. This way you can roll back or compare models when you retrain with updated data.

Real-World Integration Examples

Some concrete examples of where this works well:

  • Medical dictation: transcribe clinical notes with correct drug names, procedure codes, and anatomical terms, then feed into an EHR system
  • Gaming commentary analysis: capture esports terminology, character names, and ability names accurately for automated highlight generation
  • Local AI assistant: use as the speech frontend for a voice-controlled assistant where domain accuracy matters more than general conversation
  • Manufacturing documentation: transcribe floor workers’ verbal reports with correct part numbers and process names

The model itself is just a checkpoint on disk. How you integrate it depends on your application, but the inference code stays the same regardless of the downstream use case.