What is RAG and why use it with Ollama?

RAG (Retrieval-Augmented Generation) is a technique where an AI model answers questions by first retrieving relevant context from your documents, then generating a response grounded in that data. Using RAG with Ollama means your documents and queries never leave your machine — complete privacy, no API costs, and no token limits.

What embedding model should I use for RAG with Ollama?

The recommended embedding model is 'nomic-embed-text' (274 MB, 137M parameters). It produces 768-dimensional embeddings and performs well for most use cases. For multilingual documents, use 'mxbai-embed-large' (670 MB). Pull it with: 'ollama pull nomic-embed-text'.

Can I use RAG without coding via Open WebUI?

Yes. Open WebUI has built-in RAG support — just upload documents (PDF, DOCX, TXT, CSV) directly into a conversation. It handles chunking, embedding, vector storage, and retrieval automatically. No coding required. For more control, you can build a custom pipeline with LangChain or LlamaIndex.

How many documents can RAG handle locally?

With Ollama and ChromaDB running locally, you can index thousands of documents totaling tens of gigabytes. The main bottlenecks are embedding generation speed (GPU helps significantly) and RAM for the vector database. A typical 1,000-page PDF generates around 5,000 chunks, which ChromaDB handles easily.

RAG with Ollama: Chat with Your Documents Using Local AI

TL;DR — Quick Summary

Build a private RAG (Retrieval-Augmented Generation) pipeline with Ollama. Use local embeddings, vector databases, and Open WebUI to chat with PDFs, docs, and knowledge bases without cloud APIs.

What Is RAG?

RAG (Retrieval-Augmented Generation) is a technique that makes AI models answer questions using your specific data instead of just their training knowledge. The process:

Chunk — Split your documents into small segments (300-500 tokens each)
Embed — Convert each chunk into a numerical vector using an embedding model
Store — Save vectors in a vector database (ChromaDB, Qdrant, Milvus)
Query — When you ask a question, embed the question, find similar chunks
Generate — Pass the relevant chunks + your question to the LLM for an answer

With Ollama, the entire pipeline runs locally — your documents never leave your machine.

Why Local RAG?

Aspect	Local (Ollama)	Cloud (OpenAI)
Privacy	✅ Data stays local	❌ Data sent to API
Cost	Free	$0.01-0.06 per 1K tokens
Token limits	None	4K-128K context window
Speed	Depends on hardware	Fast (datacenter GPUs)
Offline	✅ Works offline	❌ Requires internet
Customization	Full control	Limited

Prerequisites

Ollama installed and running (see Ollama setup guide).
Embedding model: ollama pull nomic-embed-text
Chat model: ollama pull llama3.2 (or any preferred LLM)
RAM: 8 GB minimum (16 GB recommended for larger document sets)

Option A: No-Code RAG with Open WebUI

The easiest way to do RAG is through Open WebUI (see Open WebUI setup):

Configure the Embedding Model

Go to Admin Panel → Settings → Documents.
Set Embedding Model to nomic-embed-text.
Set Chunk Size to 500 and Chunk Overlap to 50.

Upload and Chat

In any conversation, click the 📎 paperclip icon.
Upload your PDF, DOCX, TXT, CSV, or MD file.
Open WebUI automatically processes: chunk → embed → store → ready.
Ask questions about the document — responses cite relevant passages.

Document Collections

For persistent libraries that you can reference across conversations:

Go to Workspace → Documents.
Upload multiple files to create a collection.
In any chat, type #collection-name to include that collection’s context.

Option B: Custom RAG Pipeline with Python

For full control, build your own pipeline:

Install Dependencies

pip install langchain langchain-community chromadb ollama pypdf

Basic RAG Script

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA

# 1. Load document
loader = PyPDFLoader("my_document.pdf")
documents = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = splitter.split_documents(documents)

# 3. Create embeddings and store in ChromaDB
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 4. Create RAG chain
llm = Ollama(model="llama3.2")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True
)

# 5. Query
result = qa_chain.invoke({"query": "What does the document say about X?"})
print(result["result"])

Choosing the Right Models

Embedding Models

Model	Size	Dimensions	Best For
`nomic-embed-text`	274 MB	768	General English documents
`mxbai-embed-large`	670 MB	1024	Multilingual, higher quality
`all-minilm`	46 MB	384	Lightweight, fastest
`snowflake-arctic-embed`	669 MB	1024	Technical / code documents

Chat Models for RAG

Model	Size	Best For
`llama3.2`	4.7 GB	General purpose, good at following instructions
`mistral`	4.1 GB	Fast, concise answers
`gemma2:9b`	5.4 GB	High quality, good reasoning
`phi3`	2.2 GB	Small but capable, fast on CPU

ollama pull nomic-embed-text
ollama pull llama3.2

Optimizing RAG Quality

Chunking Strategy

Parameter	Recommended	Effect
Chunk size	400-600 tokens	Smaller = more precise retrieval, larger = more context
Chunk overlap	50-100 tokens	Prevents cutting mid-sentence
Split by	Paragraphs/sections	Preserves semantic coherence

Retrieval Parameters

Top-K: Retrieve the top 3-5 most similar chunks (more ≠ better — too many adds noise).
Similarity threshold: Filter out chunks below a minimum relevance score.
Re-ranking: Use a second model to re-rank retrieved chunks by relevance.

Troubleshooting

AI Ignores Document Content

Cause: Chunks are too large or the question doesn’t match embedded content well.

Fix: Reduce chunk size to 300-400 tokens. Rephrase your question to use terms from the document.

Hallucinated Answers

Cause: The LLM fills in gaps when the retrieved context is insufficient.

Fix: Add a system prompt: “Only answer based on the provided context. If the context doesn’t contain the answer, say ‘I don’t have enough information.’”

Embedding Generation Is Slow

Fix: Use nomic-embed-text (fastest). On GPU, embedding is 10-50x faster. For large document sets, batch the embedding process.

Summary

RAG with Ollama lets you build a private, free AI assistant that actually knows your documents. Use Open WebUI for instant no-code RAG, or build a custom LangChain pipeline for full control. All processing stays local — your data never leaves your machine.