Whisper: Self-Hosted Speech-to-Text with OpenAI's Model — Local, Private, Free

TL;DR — Quick Summary

Run OpenAI's Whisper speech-to-text model locally for free, private audio transcription. Covers CLI, Docker, GPU acceleration, Whisper.cpp for CPU, faster-whisper, and web UI options.

What Is Whisper?

Whisper is OpenAI’s open-source automatic speech recognition (ASR) model. It can transcribe audio in 99 languages, translate speech to English, and generate subtitles with accurate timestamps — all running locally on your own hardware, completely free and private.

Key features:

99 languages — transcripts in the original language or translated to English
Multiple model sizes — from tiny (75MB) to large-v3 (3GB)
Subtitle generation — SRT, VTT, and TSV formats with timestamps
No API needed — runs entirely offline
GPU acceleration — CUDA for NVIDIA GPUs
Multiple implementations — Python, C++ (whisper.cpp), faster-whisper

Model Comparison

Model	Size	VRAM	English WER	Speed (GPU)	Speed (CPU)
`tiny`	75 MB	~1 GB	8.0%	~32x realtime	~10x realtime
`base`	142 MB	~1 GB	5.7%	~16x realtime	~7x realtime
`small`	466 MB	~2 GB	4.2%	~6x realtime	~2x realtime
`medium`	1.5 GB	~5 GB	3.5%	~2x realtime	~0.5x realtime
`large-v3`	3 GB	~6 GB	2.9%	~1x realtime	~0.1x realtime

Tip: For most use cases, small or medium offers the best accuracy vs speed tradeoff. Use large-v3 only when accuracy is critical.

Installation

Python (Original)

pip install openai-whisper

# Transcribe
whisper audio.mp3 --model base

# Transcribe with language detection and SRT output
whisper interview.wav --model small --output_format srt

# Translate to English
whisper audio_spanish.mp3 --model medium --task translate

faster-whisper (4x Faster)

pip install faster-whisper

python -c "
from faster_whisper import WhisperModel
model = WhisperModel('base', device='cuda', compute_type='float16')
segments, info = model.transcribe('audio.mp3', beam_size=5)
for segment in segments:
    print(f'[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}')
"

whisper.cpp (Best for CPU)

git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp && make

# Download a model
bash models/download-ggml-model.sh base

# Transcribe (WAV format required)
./main -m models/ggml-base.bin -f audio.wav -otxt -osrt

Docker

docker run --gpus all -v $(pwd):/data \
  onerahmet/openai-whisper-asr-webservice:latest \
  whisper /data/audio.mp3 --model base --output_dir /data/

Web UI Options

For a browser-based interface:

Project	Description	Docker
Whisper Web	Simple upload and transcribe UI	`docker run -p 9000:9000 pluja/whishper`
Whisper ASR Webservice	REST API with Swagger UI	`onerahmet/openai-whisper-asr-webservice`
Subtitle Edit	Full editor with Whisper integration	Desktop app

Common Use Cases

Use Case	Model	Command
Meeting transcription	`small`	`whisper meeting.mp3 --model small`
Video subtitles	`medium`	`whisper video.mp4 --model medium --output_format srt`
Podcast transcription	`base`	`whisper podcast.mp3 --model base --output_format txt`
Translate foreign audio	`medium`	`whisper foreign.mp3 --model medium --task translate`
Batch process folder	`base`	`for f in *.mp3; do whisper "$f" --model base; done`

Whisper vs. Cloud Speech-to-Text

Aspect	Whisper (Local)	Google STT	AWS Transcribe
Cost	Free	$0.006-0.048/min	$0.024/min
Privacy	✅ Data stays local	❌ Cloud	❌ Cloud
Offline	✅	❌	❌
Languages	99	125+	100+
Accuracy	Excellent	Excellent	Good
Custom vocabulary	❌	✅	✅
Real-time streaming	Limited	✅	✅
Speaker diarization	Via plugin	✅	✅

Summary

Whisper gives you state-of-the-art speech-to-text transcription running locally, privately, and for free. Use the Python package for quick transcriptions, faster-whisper for GPU-accelerated performance, or whisper.cpp for efficient CPU-only operation.

Frequently Asked Questions

What is Whisper and can I run it locally?

Whisper is OpenAI's open-source speech-to-text model that can transcribe audio in 99 languages with high accuracy. Yes, you can run it entirely locally — the models are freely downloadable and run on your own hardware. No API key, no cloud, no cost per transcription.

How much hardware does Whisper need?

The 'tiny' model (75MB) runs on any CPU. The 'base' model (142MB) is still fast on CPU. The 'small' model (466MB) needs 2GB+ RAM. The 'medium' model (1.5GB) benefits from a GPU. The 'large-v3' model (3GB) requires a GPU with 6GB+ VRAM for real-time speed. For CPU-only, use whisper.cpp which is 2-4x faster than Python Whisper.

What is the difference between Whisper, faster-whisper, and whisper.cpp?

Original Whisper (Python/PyTorch) is the reference implementation. faster-whisper reimplements Whisper using CTranslate2, running 4x faster with lower memory. whisper.cpp is a C++ port that runs efficiently on CPU without Python. For most users, faster-whisper (GPU) or whisper.cpp (CPU) are the best choices.

Can Whisper generate subtitles for videos?

Yes. Whisper can output in SRT and VTT subtitle formats with accurate timestamps. Run: 'whisper video.mp4 --output_format srt'. It can also translate non-English audio directly to English subtitles with the --task translate flag.