TL;DR — Quick Summary

Run LLMs like Llama 3, Mistral, Gemma, and Phi locally with Ollama. Guide covers installation, GPU acceleration, Docker deployment, REST API, and Open WebUI integration.

What Is Ollama?

Ollama is an open-source tool that makes it trivially easy to download, run, and manage large language models (LLMs) on your local machine. Think of it as Docker for AI models — you pull a model, run it, and interact via CLI or REST API.

With Ollama you get:

  • One-command model downloadsollama pull llama3.2 and you are running Meta’s latest model
  • GPU acceleration — automatic NVIDIA CUDA and Apple Silicon Metal support
  • OpenAI-compatible API — drop-in replacement for many applications
  • Complete privacy — your prompts and data never leave your network
  • No API costs — run unlimited queries after hardware investment
  • Model customization — create custom models with Modelfiles (system prompts, parameters)
  • Docker support — deploy Ollama as a container for server environments

Prerequisites

  • Linux, macOS (Apple Silicon recommended), or Windows 10/11.
  • 8 GB RAM minimum (16 GB recommended for 13B models).
  • 20+ GB free disk space for model storage.
  • Optional: NVIDIA GPU with 6+ GB VRAM and CUDA drivers for GPU acceleration.
  • Optional: Docker, if deploying Ollama as a container.

Installation

Linux (One-Line Install)

curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama and configures it as a systemd service that starts automatically on boot. Verify with:

ollama --version
systemctl status ollama

Docker Deployment

For server environments or when you want isolation:

# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# With NVIDIA GPU
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

macOS and Windows

Download the installer from ollama.com and run it. On macOS, Ollama runs as a menu bar application. On Windows, it runs as a system tray application.


Downloading and Running Models

ModelSizeParametersBest For
llama3.24.7 GB8BGeneral purpose, conversations
llama3.2:70b40 GB70BHigh-quality reasoning (needs 64 GB RAM)
mistral4.1 GB7BFast, efficient general purpose
gemma25.4 GB9BGoogle’s quality model
phi32.2 GB3.8BSmall, fast, surprisingly capable
codellama3.8 GB7BCode generation and review
llama3.2-vision7.9 GB11BImage understanding + text
deepseek-coder-v28.9 GB16BAdvanced code generation
qwen2.54.7 GB7BMultilingual, strong at Chinese
nomic-embed-text274 MB137MText embeddings for RAG

Pull and Run

# Download a model
ollama pull llama3.2

# Start interactive chat
ollama run llama3.2

# Run with a specific prompt (non-interactive)
ollama run llama3.2 "Explain Kubernetes in 3 sentences"

# List downloaded models
ollama list

# Remove a model
ollama rm codellama

REST API

Ollama exposes an OpenAI-compatible REST API on port 11434. This is the primary way to integrate Ollama with applications.

Generate (Completion)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "What is a reverse proxy?",
  "stream": false
}'

Chat (Conversation)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "system", "content": "You are a helpful sysadmin assistant."},
    {"role": "user", "content": "How do I check disk usage on Linux?"}
  ],
  "stream": false
}'

Embeddings (for RAG)

curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Ollama is an open-source LLM runner"
}'

OpenAI-Compatible Endpoint

For applications that support OpenAI’s API format, Ollama provides a compatible endpoint:

curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'

Simply change the API base URL from https://api.openai.com/v1 to http://localhost:11434/v1 in most applications.


GPU Acceleration

NVIDIA GPUs

Ollama automatically detects and uses NVIDIA GPUs if CUDA drivers are installed:

# Verify NVIDIA driver
nvidia-smi

# Run a model (GPU is used automatically)
ollama run llama3.2

# Check GPU memory usage during inference
watch -n 1 nvidia-smi
GPUVRAMModels You Can Run
RTX 306012 GB7B–13B fully in VRAM
RTX 3090 / 409024 GBUp to 30B fully in VRAM
2× RTX 309048 GB70B with model splitting
A10080 GB70B+ fully in VRAM

Apple Silicon (M1/M2/M3/M4)

Ollama uses Metal on Apple Silicon for GPU acceleration automatically. The unified memory architecture means the model can use the full system memory:

ChipUnified MemoryPractical Max Model
M1 / M2 (8 GB)8 GB7B models
M1 Pro / M2 Pro (16 GB)16 GB13B models
M1 Max / M2 Max (32 GB)32 GB30B models
M1 Ultra / M2 Ultra (64 GB)64 GB70B models

Custom Models with Modelfiles

Create custom model configurations to set system prompts, temperature, and other parameters:

# Modelfile for a DevOps assistant
FROM llama3.2

SYSTEM """You are a senior DevOps engineer. Answer questions about Linux, Docker, 
Kubernetes, CI/CD, and infrastructure. Provide practical commands and examples. 
Keep answers concise and actionable."""

PARAMETER temperature 0.3
PARAMETER num_ctx 4096

Build and run:

ollama create devops-assistant -f Modelfile
ollama run devops-assistant "How do I debug a CrashLoopBackOff in Kubernetes?"

Open WebUI — ChatGPT-Like Interface

Open WebUI provides a polished, ChatGPT-style interface for Ollama:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Access at http://localhost:3000. Features include:

  • Conversation history with search
  • Model switching within conversations
  • File uploads (PDF, images for vision models)
  • Multi-user with role-based access control
  • RAG — upload documents and chat with them
  • Web search integration (optional)
  • Custom model presets (system prompts, temperature)

Troubleshooting

Model Runs Slowly (CPU Only)

Cause: No GPU detected, or CUDA drivers not installed.

Fix:

  1. Check GPU detection: ollama run llama3.2 — watch nvidia-smi output.
  2. Install NVIDIA CUDA drivers: sudo apt install nvidia-driver-535 nvidia-cuda-toolkit.
  3. Restart Ollama: sudo systemctl restart ollama.
  4. As a fallback, use a smaller model: ollama run phi3 (3.8B parameters, fast on CPU).

Out of Memory Errors

Cause: Model too large for available RAM/VRAM.

Fix:

  1. Try a quantized (smaller) version: ollama pull llama3.2:8b-q4_0 (Q4 quantization uses ~50% less memory).
  2. Close other memory-heavy applications.
  3. Add swap space for larger models: sudo fallocate -l 8G /swapfile && sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile.

API Connection Refused

Cause: Ollama is not listening on the expected address.

Fix:

  1. Check the service: systemctl status ollama.
  2. By default, Ollama only listens on 127.0.0.1:11434. To expose it on all interfaces (e.g., for Docker or remote access), set the environment variable: OLLAMA_HOST=0.0.0.0.
  3. Edit the systemd service: sudo systemctl edit ollama and add Environment="OLLAMA_HOST=0.0.0.0", then restart.

Ollama vs. Other Local LLM Tools

FeatureOllamaLM StudioText Generation WebUIllama.cpp
CLI Interface
GUI InterfaceVia Open WebUI✅ Built-in✅ Built-in
REST API✅ OpenAI-compatibleLimited
Docker Support
Model Library100+ models100+ modelsAny GGUFAny GGUF
GPU SupportCUDA + MetalCUDA + MetalCUDA + MetalCUDA + Metal
Custom Models✅ Modelfile
Server Mode✅ systemd❌ (desktop)Limited
Best ForServers, APIs, DockerDesktop usersPower usersDevelopers

Summary

Ollama is the fastest way to go from zero to running AI models locally. A single curl | sh install followed by ollama pull llama3.2 gives you a fully functional local AI that is private, free, and API-compatible with OpenAI. Pair it with Open WebUI for a ChatGPT-like experience on your own hardware.