What is Ollama and why would I run LLMs locally?

Ollama is an open-source tool that lets you download and run large language models (LLMs) locally on your own machine. Running locally means your data never leaves your network (privacy), there are no API costs (after hardware), you can work offline, and you have full control over model selection and configuration.

What hardware do I need to run Ollama?

Minimum: 8 GB RAM for 7B parameter models. Recommended: 16 GB RAM for 13B models, 32 GB for 70B models. GPU acceleration (NVIDIA with 6+ GB VRAM, or Apple Silicon M1+) dramatically improves speed. CPU-only mode works but is 5-10x slower. A modern SSD with 20+ GB free is needed for model storage.

How does Ollama compare to OpenAI's API?

Ollama provides a local, OpenAI-compatible REST API on port 11434. This means you can use Ollama as a drop-in replacement for OpenAI in many applications by changing the API base URL. The tradeoff is that local models (7B-70B parameters) are less capable than GPT-4 but completely private and free after hardware costs.

Ollama: Run AI Language Models Locally — Setup, GPU Acceleration, and API Guide

Q: Can I use Ollama with a web interface like ChatGPT?

Yes. Open WebUI (formerly Ollama WebUI) provides a ChatGPT-like interface for Ollama. Install it with Docker: 'docker run -d -p 3000:8080 --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui ghcr.io/open-webui/open-webui:main'. It supports conversations, model switching, file uploads, and multi-user accounts.

TL;DR — Quick Summary

Run LLMs like Llama 3, Mistral, Gemma, and Phi locally with Ollama. Guide covers installation, GPU acceleration, Docker deployment, REST API, and Open WebUI integration.

What Is Ollama?

Ollama is an open-source tool that makes it trivially easy to download, run, and manage large language models (LLMs) on your local machine. Think of it as Docker for AI models — you pull a model, run it, and interact via CLI or REST API.

With Ollama you get:

One-command model downloads — ollama pull llama3.2 and you are running Meta’s latest model
GPU acceleration — automatic NVIDIA CUDA and Apple Silicon Metal support
OpenAI-compatible API — drop-in replacement for many applications
Complete privacy — your prompts and data never leave your network
No API costs — run unlimited queries after hardware investment
Model customization — create custom models with Modelfiles (system prompts, parameters)
Docker support — deploy Ollama as a container for server environments

Prerequisites

Linux, macOS (Apple Silicon recommended), or Windows 10/11.
8 GB RAM minimum (16 GB recommended for 13B models).
20+ GB free disk space for model storage.
Optional: NVIDIA GPU with 6+ GB VRAM and CUDA drivers for GPU acceleration.
Optional: Docker, if deploying Ollama as a container.

Installation

Linux (One-Line Install)

curl -fsSL https://ollama.com/install.sh | sh

This installs Ollama and configures it as a systemd service that starts automatically on boot. Verify with:

ollama --version
systemctl status ollama

Docker Deployment

For server environments or when you want isolation:

# CPU only
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# With NVIDIA GPU
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

macOS and Windows

Download the installer from ollama.com and run it. On macOS, Ollama runs as a menu bar application. On Windows, it runs as a system tray application.

Downloading and Running Models

Popular Models

Model	Size	Parameters	Best For
`llama3.2`	4.7 GB	8B	General purpose, conversations
`llama3.2:70b`	40 GB	70B	High-quality reasoning (needs 64 GB RAM)
`mistral`	4.1 GB	7B	Fast, efficient general purpose
`gemma2`	5.4 GB	9B	Google’s quality model
`phi3`	2.2 GB	3.8B	Small, fast, surprisingly capable
`codellama`	3.8 GB	7B	Code generation and review
`llama3.2-vision`	7.9 GB	11B	Image understanding + text
`deepseek-coder-v2`	8.9 GB	16B	Advanced code generation
`qwen2.5`	4.7 GB	7B	Multilingual, strong at Chinese
`nomic-embed-text`	274 MB	137M	Text embeddings for RAG

Pull and Run

# Download a model
ollama pull llama3.2

# Start interactive chat
ollama run llama3.2

# Run with a specific prompt (non-interactive)
ollama run llama3.2 "Explain Kubernetes in 3 sentences"

# List downloaded models
ollama list

# Remove a model
ollama rm codellama

REST API

Ollama exposes an OpenAI-compatible REST API on port 11434. This is the primary way to integrate Ollama with applications.

Generate (Completion)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "What is a reverse proxy?",
  "stream": false
}'

Chat (Conversation)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "system", "content": "You are a helpful sysadmin assistant."},
    {"role": "user", "content": "How do I check disk usage on Linux?"}
  ],
  "stream": false
}'

Embeddings (for RAG)

curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Ollama is an open-source LLM runner"
}'

OpenAI-Compatible Endpoint

For applications that support OpenAI’s API format, Ollama provides a compatible endpoint:

curl http://localhost:11434/v1/chat/completions -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "Hello!"}
  ]
}'

Simply change the API base URL from https://api.openai.com/v1 to http://localhost:11434/v1 in most applications.

GPU Acceleration

NVIDIA GPUs

Ollama automatically detects and uses NVIDIA GPUs if CUDA drivers are installed:

# Verify NVIDIA driver
nvidia-smi

# Run a model (GPU is used automatically)
ollama run llama3.2

# Check GPU memory usage during inference
watch -n 1 nvidia-smi

GPU	VRAM	Models You Can Run
RTX 3060	12 GB	7B–13B fully in VRAM
RTX 3090 / 4090	24 GB	Up to 30B fully in VRAM
2× RTX 3090	48 GB	70B with model splitting
A100	80 GB	70B+ fully in VRAM

Apple Silicon (M1/M2/M3/M4)

Ollama uses Metal on Apple Silicon for GPU acceleration automatically. The unified memory architecture means the model can use the full system memory:

Chip	Unified Memory	Practical Max Model
M1 / M2 (8 GB)	8 GB	7B models
M1 Pro / M2 Pro (16 GB)	16 GB	13B models
M1 Max / M2 Max (32 GB)	32 GB	30B models
M1 Ultra / M2 Ultra (64 GB)	64 GB	70B models

Custom Models with Modelfiles

Create custom model configurations to set system prompts, temperature, and other parameters:

# Modelfile for a DevOps assistant
FROM llama3.2

SYSTEM """You are a senior DevOps engineer. Answer questions about Linux, Docker, 
Kubernetes, CI/CD, and infrastructure. Provide practical commands and examples. 
Keep answers concise and actionable."""

PARAMETER temperature 0.3
PARAMETER num_ctx 4096

Build and run:

ollama create devops-assistant -f Modelfile
ollama run devops-assistant "How do I debug a CrashLoopBackOff in Kubernetes?"

Open WebUI — ChatGPT-Like Interface

Open WebUI provides a polished, ChatGPT-style interface for Ollama:

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Access at http://localhost:3000. Features include:

Conversation history with search
Model switching within conversations
File uploads (PDF, images for vision models)
Multi-user with role-based access control
RAG — upload documents and chat with them
Web search integration (optional)
Custom model presets (system prompts, temperature)

Troubleshooting

Model Runs Slowly (CPU Only)

Cause: No GPU detected, or CUDA drivers not installed.

Fix:

Check GPU detection: ollama run llama3.2 — watch nvidia-smi output.
Install NVIDIA CUDA drivers: sudo apt install nvidia-driver-535 nvidia-cuda-toolkit.
Restart Ollama: sudo systemctl restart ollama.
As a fallback, use a smaller model: ollama run phi3 (3.8B parameters, fast on CPU).

Out of Memory Errors

Cause: Model too large for available RAM/VRAM.

Fix:

Try a quantized (smaller) version: ollama pull llama3.2:8b-q4_0 (Q4 quantization uses ~50% less memory).
Close other memory-heavy applications.
Add swap space for larger models: sudo fallocate -l 8G /swapfile && sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile.

API Connection Refused

Cause: Ollama is not listening on the expected address.

Fix:

Check the service: systemctl status ollama.
By default, Ollama only listens on 127.0.0.1:11434. To expose it on all interfaces (e.g., for Docker or remote access), set the environment variable: OLLAMA_HOST=0.0.0.0.
Edit the systemd service: sudo systemctl edit ollama and add Environment="OLLAMA_HOST=0.0.0.0", then restart.

Ollama vs. Other Local LLM Tools

Feature	Ollama	LM Studio	Text Generation WebUI	llama.cpp
CLI Interface	✅	❌	❌	✅
GUI Interface	Via Open WebUI	✅ Built-in	✅ Built-in	❌
REST API	✅ OpenAI-compatible	✅	✅	Limited
Docker Support	✅	❌	✅	❌
Model Library	100+ models	100+ models	Any GGUF	Any GGUF
GPU Support	CUDA + Metal	CUDA + Metal	CUDA + Metal	CUDA + Metal
Custom Models	✅ Modelfile	❌	✅	❌
Server Mode	✅ systemd	❌ (desktop)	✅	Limited
Best For	Servers, APIs, Docker	Desktop users	Power users	Developers

Summary

Ollama is the fastest way to go from zero to running AI models locally. A single curl | sh install followed by ollama pull llama3.2 gives you a fully functional local AI that is private, free, and API-compatible with OpenAI. Pair it with Open WebUI for a ChatGPT-like experience on your own hardware.

What Is Ollama?

Prerequisites

Installation

Linux (One-Line Install)

Docker Deployment

macOS and Windows

Downloading and Running Models

Popular Models

Pull and Run

REST API

Generate (Completion)

Chat (Conversation)

Embeddings (for RAG)

OpenAI-Compatible Endpoint

GPU Acceleration

NVIDIA GPUs

Apple Silicon (M1/M2/M3/M4)

Custom Models with Modelfiles

Open WebUI — ChatGPT-Like Interface

Troubleshooting

Model Runs Slowly (CPU Only)

Out of Memory Errors

API Connection Refused

Ollama vs. Other Local LLM Tools

Summary

Related Articles

Frequently Asked Questions

Related Articles

Open WebUI: Self-Hosted ChatGPT Interface for Ollama and OpenAI Models

Stable Diffusion WebUI: Self-Hosted AI Image Generation — Free, Private, GPU-Accelerated

Whisper: Self-Hosted Speech-to-Text with OpenAI's Model — Local, Private, Free