TL;DR — Quick Summary
Build a private RAG (Retrieval-Augmented Generation) pipeline with Ollama. Use local embeddings, vector databases, and Open WebUI to chat with PDFs, docs, and knowledge bases without cloud APIs.
What Is RAG?
RAG (Retrieval-Augmented Generation) is a technique that makes AI models answer questions using your specific data instead of just their training knowledge. The process:
- Chunk — Split your documents into small segments (300-500 tokens each)
- Embed — Convert each chunk into a numerical vector using an embedding model
- Store — Save vectors in a vector database (ChromaDB, Qdrant, Milvus)
- Query — When you ask a question, embed the question, find similar chunks
- Generate — Pass the relevant chunks + your question to the LLM for an answer
With Ollama, the entire pipeline runs locally — your documents never leave your machine.
Why Local RAG?
| Aspect | Local (Ollama) | Cloud (OpenAI) |
|---|---|---|
| Privacy | ✅ Data stays local | ❌ Data sent to API |
| Cost | Free | $0.01-0.06 per 1K tokens |
| Token limits | None | 4K-128K context window |
| Speed | Depends on hardware | Fast (datacenter GPUs) |
| Offline | ✅ Works offline | ❌ Requires internet |
| Customization | Full control | Limited |
Prerequisites
- Ollama installed and running (see Ollama setup guide).
- Embedding model:
ollama pull nomic-embed-text - Chat model:
ollama pull llama3.2(or any preferred LLM) - RAM: 8 GB minimum (16 GB recommended for larger document sets)
Option A: No-Code RAG with Open WebUI
The easiest way to do RAG is through Open WebUI (see Open WebUI setup):
Configure the Embedding Model
- Go to Admin Panel → Settings → Documents.
- Set Embedding Model to
nomic-embed-text. - Set Chunk Size to 500 and Chunk Overlap to 50.
Upload and Chat
- In any conversation, click the 📎 paperclip icon.
- Upload your PDF, DOCX, TXT, CSV, or MD file.
- Open WebUI automatically processes: chunk → embed → store → ready.
- Ask questions about the document — responses cite relevant passages.
Document Collections
For persistent libraries that you can reference across conversations:
- Go to Workspace → Documents.
- Upload multiple files to create a collection.
- In any chat, type
#collection-nameto include that collection’s context.
Option B: Custom RAG Pipeline with Python
For full control, build your own pipeline:
Install Dependencies
pip install langchain langchain-community chromadb ollama pypdf
Basic RAG Script
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA
# 1. Load document
loader = PyPDFLoader("my_document.pdf")
documents = loader.load()
# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
# 3. Create embeddings and store in ChromaDB
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
# 4. Create RAG chain
llm = Ollama(model="llama3.2")
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
return_source_documents=True
)
# 5. Query
result = qa_chain.invoke({"query": "What does the document say about X?"})
print(result["result"])
Choosing the Right Models
Embedding Models
| Model | Size | Dimensions | Best For |
|---|---|---|---|
nomic-embed-text | 274 MB | 768 | General English documents |
mxbai-embed-large | 670 MB | 1024 | Multilingual, higher quality |
all-minilm | 46 MB | 384 | Lightweight, fastest |
snowflake-arctic-embed | 669 MB | 1024 | Technical / code documents |
Chat Models for RAG
| Model | Size | Best For |
|---|---|---|
llama3.2 | 4.7 GB | General purpose, good at following instructions |
mistral | 4.1 GB | Fast, concise answers |
gemma2:9b | 5.4 GB | High quality, good reasoning |
phi3 | 2.2 GB | Small but capable, fast on CPU |
ollama pull nomic-embed-text
ollama pull llama3.2
Optimizing RAG Quality
Chunking Strategy
| Parameter | Recommended | Effect |
|---|---|---|
| Chunk size | 400-600 tokens | Smaller = more precise retrieval, larger = more context |
| Chunk overlap | 50-100 tokens | Prevents cutting mid-sentence |
| Split by | Paragraphs/sections | Preserves semantic coherence |
Retrieval Parameters
- Top-K: Retrieve the top 3-5 most similar chunks (more ≠ better — too many adds noise).
- Similarity threshold: Filter out chunks below a minimum relevance score.
- Re-ranking: Use a second model to re-rank retrieved chunks by relevance.
Troubleshooting
AI Ignores Document Content
Cause: Chunks are too large or the question doesn’t match embedded content well.
Fix: Reduce chunk size to 300-400 tokens. Rephrase your question to use terms from the document.
Hallucinated Answers
Cause: The LLM fills in gaps when the retrieved context is insufficient.
Fix: Add a system prompt: “Only answer based on the provided context. If the context doesn’t contain the answer, say ‘I don’t have enough information.’”
Embedding Generation Is Slow
Fix: Use nomic-embed-text (fastest). On GPU, embedding is 10-50x faster. For large document sets, batch the embedding process.
Summary
RAG with Ollama lets you build a private, free AI assistant that actually knows your documents. Use Open WebUI for instant no-code RAG, or build a custom LangChain pipeline for full control. All processing stays local — your data never leaves your machine.