TL;DR — Quick Summary

Run OpenAI's Whisper speech-to-text model locally for free, private audio transcription. Covers CLI, Docker, GPU acceleration, Whisper.cpp for CPU, faster-whisper, and web UI options.

What Is Whisper?

Whisper is OpenAI’s open-source automatic speech recognition (ASR) model. It can transcribe audio in 99 languages, translate speech to English, and generate subtitles with accurate timestamps — all running locally on your own hardware, completely free and private.

Key features:

  • 99 languages — transcripts in the original language or translated to English
  • Multiple model sizes — from tiny (75MB) to large-v3 (3GB)
  • Subtitle generation — SRT, VTT, and TSV formats with timestamps
  • No API needed — runs entirely offline
  • GPU acceleration — CUDA for NVIDIA GPUs
  • Multiple implementations — Python, C++ (whisper.cpp), faster-whisper

Model Comparison

ModelSizeVRAMEnglish WERSpeed (GPU)Speed (CPU)
tiny75 MB~1 GB8.0%~32x realtime~10x realtime
base142 MB~1 GB5.7%~16x realtime~7x realtime
small466 MB~2 GB4.2%~6x realtime~2x realtime
medium1.5 GB~5 GB3.5%~2x realtime~0.5x realtime
large-v33 GB~6 GB2.9%~1x realtime~0.1x realtime

Tip: For most use cases, small or medium offers the best accuracy vs speed tradeoff. Use large-v3 only when accuracy is critical.


Installation

Python (Original)

pip install openai-whisper

# Transcribe
whisper audio.mp3 --model base

# Transcribe with language detection and SRT output
whisper interview.wav --model small --output_format srt

# Translate to English
whisper audio_spanish.mp3 --model medium --task translate

faster-whisper (4x Faster)

pip install faster-whisper

python -c "
from faster_whisper import WhisperModel
model = WhisperModel('base', device='cuda', compute_type='float16')
segments, info = model.transcribe('audio.mp3', beam_size=5)
for segment in segments:
    print(f'[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}')
"

whisper.cpp (Best for CPU)

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp && make

# Download a model
bash models/download-ggml-model.sh base

# Transcribe (WAV format required)
./main -m models/ggml-base.bin -f audio.wav -otxt -osrt

Docker

docker run --gpus all -v $(pwd):/data \
  onerahmet/openai-whisper-asr-webservice:latest \
  whisper /data/audio.mp3 --model base --output_dir /data/

Web UI Options

For a browser-based interface:

ProjectDescriptionDocker
Whisper WebSimple upload and transcribe UIdocker run -p 9000:9000 pluja/whishper
Whisper ASR WebserviceREST API with Swagger UIonerahmet/openai-whisper-asr-webservice
Subtitle EditFull editor with Whisper integrationDesktop app

Common Use Cases

Use CaseModelCommand
Meeting transcriptionsmallwhisper meeting.mp3 --model small
Video subtitlesmediumwhisper video.mp4 --model medium --output_format srt
Podcast transcriptionbasewhisper podcast.mp3 --model base --output_format txt
Translate foreign audiomediumwhisper foreign.mp3 --model medium --task translate
Batch process folderbasefor f in *.mp3; do whisper "$f" --model base; done

Whisper vs. Cloud Speech-to-Text

AspectWhisper (Local)Google STTAWS Transcribe
CostFree$0.006-0.048/min$0.024/min
Privacy✅ Data stays local❌ Cloud❌ Cloud
Offline
Languages99125+100+
AccuracyExcellentExcellentGood
Custom vocabulary
Real-time streamingLimited
Speaker diarizationVia plugin

Summary

Whisper gives you state-of-the-art speech-to-text transcription running locally, privately, and for free. Use the Python package for quick transcriptions, faster-whisper for GPU-accelerated performance, or whisper.cpp for efficient CPU-only operation.