TL;DR — Quick Summary
Run LLMs like Llama 3, Mistral, Gemma, and Phi locally with Ollama. Guide covers installation, GPU acceleration, Docker deployment, REST API, and Open WebUI integration.
What Is Ollama?
Ollama is an open-source tool that makes it trivially easy to download, run, and manage large language models (LLMs) on your local machine. Think of it as Docker for AI models — you pull a model, run it, and interact via CLI or REST API.
With Ollama you get:
- One-command model downloads —
ollama pull llama3.2and you are running Meta’s latest model - GPU acceleration — automatic NVIDIA CUDA and Apple Silicon Metal support
- OpenAI-compatible API — drop-in replacement for many applications
- Complete privacy — your prompts and data never leave your network
- No API costs — run unlimited queries after hardware investment
- Model customization — create custom models with Modelfiles (system prompts, parameters)
- Docker support — deploy Ollama as a container for server environments
Prerequisites
- Linux, macOS (Apple Silicon recommended), or Windows 10/11.
- 8 GB RAM minimum (16 GB recommended for 13B models).
- 20+ GB free disk space for model storage.
- Optional: NVIDIA GPU with 6+ GB VRAM and CUDA drivers for GPU acceleration.
- Optional: Docker, if deploying Ollama as a container.
Installation
Linux (One-Line Install)
curl -fsSL https://ollama.com/install.sh | sh
This installs Ollama and configures it as a systemd service that starts automatically on boot. Verify with:
ollama --version
systemctl status ollama
Docker Deployment
For server environments or when you want isolation:
# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# With NVIDIA GPU
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
macOS and Windows
Download the installer from ollama.com and run it. On macOS, Ollama runs as a menu bar application. On Windows, it runs as a system tray application.
Downloading and Running Models
Popular Models
| Model | Size | Parameters | Best For |
|---|---|---|---|
llama3.2 | 4.7 GB | 8B | General purpose, conversations |
llama3.2:70b | 40 GB | 70B | High-quality reasoning (needs 64 GB RAM) |
mistral | 4.1 GB | 7B | Fast, efficient general purpose |
gemma2 | 5.4 GB | 9B | Google’s quality model |
phi3 | 2.2 GB | 3.8B | Small, fast, surprisingly capable |
codellama | 3.8 GB | 7B | Code generation and review |
llama3.2-vision | 7.9 GB | 11B | Image understanding + text |
deepseek-coder-v2 | 8.9 GB | 16B | Advanced code generation |
qwen2.5 | 4.7 GB | 7B | Multilingual, strong at Chinese |
nomic-embed-text | 274 MB | 137M | Text embeddings for RAG |
Pull and Run
# Download a model
ollama pull llama3.2
# Start interactive chat
ollama run llama3.2
# Run with a specific prompt (non-interactive)
ollama run llama3.2 "Explain Kubernetes in 3 sentences"
# List downloaded models
ollama list
# Remove a model
ollama rm codellama
REST API
Ollama exposes an OpenAI-compatible REST API on port 11434. This is the primary way to integrate Ollama with applications.
Generate (Completion)
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "What is a reverse proxy?",
"stream": false
}'
Chat (Conversation)
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "system", "content": "You are a helpful sysadmin assistant."},
{"role": "user", "content": "How do I check disk usage on Linux?"}
],
"stream": false
}'
Embeddings (for RAG)
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "Ollama is an open-source LLM runner"
}'
OpenAI-Compatible Endpoint
For applications that support OpenAI’s API format, Ollama provides a compatible endpoint:
curl http://localhost:11434/v1/chat/completions -d '{
"model": "llama3.2",
"messages": [
{"role": "user", "content": "Hello!"}
]
}'
Simply change the API base URL from https://api.openai.com/v1 to http://localhost:11434/v1 in most applications.
GPU Acceleration
NVIDIA GPUs
Ollama automatically detects and uses NVIDIA GPUs if CUDA drivers are installed:
# Verify NVIDIA driver
nvidia-smi
# Run a model (GPU is used automatically)
ollama run llama3.2
# Check GPU memory usage during inference
watch -n 1 nvidia-smi
| GPU | VRAM | Models You Can Run |
|---|---|---|
| RTX 3060 | 12 GB | 7B–13B fully in VRAM |
| RTX 3090 / 4090 | 24 GB | Up to 30B fully in VRAM |
| 2× RTX 3090 | 48 GB | 70B with model splitting |
| A100 | 80 GB | 70B+ fully in VRAM |
Apple Silicon (M1/M2/M3/M4)
Ollama uses Metal on Apple Silicon for GPU acceleration automatically. The unified memory architecture means the model can use the full system memory:
| Chip | Unified Memory | Practical Max Model |
|---|---|---|
| M1 / M2 (8 GB) | 8 GB | 7B models |
| M1 Pro / M2 Pro (16 GB) | 16 GB | 13B models |
| M1 Max / M2 Max (32 GB) | 32 GB | 30B models |
| M1 Ultra / M2 Ultra (64 GB) | 64 GB | 70B models |
Custom Models with Modelfiles
Create custom model configurations to set system prompts, temperature, and other parameters:
# Modelfile for a DevOps assistant
FROM llama3.2
SYSTEM """You are a senior DevOps engineer. Answer questions about Linux, Docker,
Kubernetes, CI/CD, and infrastructure. Provide practical commands and examples.
Keep answers concise and actionable."""
PARAMETER temperature 0.3
PARAMETER num_ctx 4096
Build and run:
ollama create devops-assistant -f Modelfile
ollama run devops-assistant "How do I debug a CrashLoopBackOff in Kubernetes?"
Open WebUI — ChatGPT-Like Interface
Open WebUI provides a polished, ChatGPT-style interface for Ollama:
docker run -d -p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Access at http://localhost:3000. Features include:
- Conversation history with search
- Model switching within conversations
- File uploads (PDF, images for vision models)
- Multi-user with role-based access control
- RAG — upload documents and chat with them
- Web search integration (optional)
- Custom model presets (system prompts, temperature)
Troubleshooting
Model Runs Slowly (CPU Only)
Cause: No GPU detected, or CUDA drivers not installed.
Fix:
- Check GPU detection:
ollama run llama3.2— watchnvidia-smioutput. - Install NVIDIA CUDA drivers:
sudo apt install nvidia-driver-535 nvidia-cuda-toolkit. - Restart Ollama:
sudo systemctl restart ollama. - As a fallback, use a smaller model:
ollama run phi3(3.8B parameters, fast on CPU).
Out of Memory Errors
Cause: Model too large for available RAM/VRAM.
Fix:
- Try a quantized (smaller) version:
ollama pull llama3.2:8b-q4_0(Q4 quantization uses ~50% less memory). - Close other memory-heavy applications.
- Add swap space for larger models:
sudo fallocate -l 8G /swapfile && sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile.
API Connection Refused
Cause: Ollama is not listening on the expected address.
Fix:
- Check the service:
systemctl status ollama. - By default, Ollama only listens on
127.0.0.1:11434. To expose it on all interfaces (e.g., for Docker or remote access), set the environment variable:OLLAMA_HOST=0.0.0.0. - Edit the systemd service:
sudo systemctl edit ollamaand addEnvironment="OLLAMA_HOST=0.0.0.0", then restart.
Ollama vs. Other Local LLM Tools
| Feature | Ollama | LM Studio | Text Generation WebUI | llama.cpp |
|---|---|---|---|---|
| CLI Interface | ✅ | ❌ | ❌ | ✅ |
| GUI Interface | Via Open WebUI | ✅ Built-in | ✅ Built-in | ❌ |
| REST API | ✅ OpenAI-compatible | ✅ | ✅ | Limited |
| Docker Support | ✅ | ❌ | ✅ | ❌ |
| Model Library | 100+ models | 100+ models | Any GGUF | Any GGUF |
| GPU Support | CUDA + Metal | CUDA + Metal | CUDA + Metal | CUDA + Metal |
| Custom Models | ✅ Modelfile | ❌ | ✅ | ❌ |
| Server Mode | ✅ systemd | ❌ (desktop) | ✅ | Limited |
| Best For | Servers, APIs, Docker | Desktop users | Power users | Developers |
Summary
Ollama is the fastest way to go from zero to running AI models locally. A single curl | sh install followed by ollama pull llama3.2 gives you a fully functional local AI that is private, free, and API-compatible with OpenAI. Pair it with Open WebUI for a ChatGPT-like experience on your own hardware.