TL;DR — Quick Summary

Build a private RAG (Retrieval-Augmented Generation) pipeline with Ollama. Use local embeddings, vector databases, and Open WebUI to chat with PDFs, docs, and knowledge bases without cloud APIs.

What Is RAG?

RAG (Retrieval-Augmented Generation) is a technique that makes AI models answer questions using your specific data instead of just their training knowledge. The process:

  1. Chunk — Split your documents into small segments (300-500 tokens each)
  2. Embed — Convert each chunk into a numerical vector using an embedding model
  3. Store — Save vectors in a vector database (ChromaDB, Qdrant, Milvus)
  4. Query — When you ask a question, embed the question, find similar chunks
  5. Generate — Pass the relevant chunks + your question to the LLM for an answer

With Ollama, the entire pipeline runs locally — your documents never leave your machine.

Why Local RAG?

AspectLocal (Ollama)Cloud (OpenAI)
Privacy✅ Data stays local❌ Data sent to API
CostFree$0.01-0.06 per 1K tokens
Token limitsNone4K-128K context window
SpeedDepends on hardwareFast (datacenter GPUs)
Offline✅ Works offline❌ Requires internet
CustomizationFull controlLimited

Prerequisites

  • Ollama installed and running (see Ollama setup guide).
  • Embedding model: ollama pull nomic-embed-text
  • Chat model: ollama pull llama3.2 (or any preferred LLM)
  • RAM: 8 GB minimum (16 GB recommended for larger document sets)

Option A: No-Code RAG with Open WebUI

The easiest way to do RAG is through Open WebUI (see Open WebUI setup):

Configure the Embedding Model

  1. Go to Admin Panel → Settings → Documents.
  2. Set Embedding Model to nomic-embed-text.
  3. Set Chunk Size to 500 and Chunk Overlap to 50.

Upload and Chat

  1. In any conversation, click the 📎 paperclip icon.
  2. Upload your PDF, DOCX, TXT, CSV, or MD file.
  3. Open WebUI automatically processes: chunk → embed → store → ready.
  4. Ask questions about the document — responses cite relevant passages.

Document Collections

For persistent libraries that you can reference across conversations:

  1. Go to Workspace → Documents.
  2. Upload multiple files to create a collection.
  3. In any chat, type #collection-name to include that collection’s context.

Option B: Custom RAG Pipeline with Python

For full control, build your own pipeline:

Install Dependencies

pip install langchain langchain-community chromadb ollama pypdf

Basic RAG Script

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Ollama
from langchain.chains import RetrievalQA

# 1. Load document
loader = PyPDFLoader("my_document.pdf")
documents = loader.load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)
chunks = splitter.split_documents(documents)

# 3. Create embeddings and store in ChromaDB
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 4. Create RAG chain
llm = Ollama(model="llama3.2")
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    return_source_documents=True
)

# 5. Query
result = qa_chain.invoke({"query": "What does the document say about X?"})
print(result["result"])

Choosing the Right Models

Embedding Models

ModelSizeDimensionsBest For
nomic-embed-text274 MB768General English documents
mxbai-embed-large670 MB1024Multilingual, higher quality
all-minilm46 MB384Lightweight, fastest
snowflake-arctic-embed669 MB1024Technical / code documents

Chat Models for RAG

ModelSizeBest For
llama3.24.7 GBGeneral purpose, good at following instructions
mistral4.1 GBFast, concise answers
gemma2:9b5.4 GBHigh quality, good reasoning
phi32.2 GBSmall but capable, fast on CPU
ollama pull nomic-embed-text
ollama pull llama3.2

Optimizing RAG Quality

Chunking Strategy

ParameterRecommendedEffect
Chunk size400-600 tokensSmaller = more precise retrieval, larger = more context
Chunk overlap50-100 tokensPrevents cutting mid-sentence
Split byParagraphs/sectionsPreserves semantic coherence

Retrieval Parameters

  • Top-K: Retrieve the top 3-5 most similar chunks (more ≠ better — too many adds noise).
  • Similarity threshold: Filter out chunks below a minimum relevance score.
  • Re-ranking: Use a second model to re-rank retrieved chunks by relevance.

Troubleshooting

AI Ignores Document Content

Cause: Chunks are too large or the question doesn’t match embedded content well.

Fix: Reduce chunk size to 300-400 tokens. Rephrase your question to use terms from the document.

Hallucinated Answers

Cause: The LLM fills in gaps when the retrieved context is insufficient.

Fix: Add a system prompt: “Only answer based on the provided context. If the context doesn’t contain the answer, say ‘I don’t have enough information.’”

Embedding Generation Is Slow

Fix: Use nomic-embed-text (fastest). On GPU, embedding is 10-50x faster. For large document sets, batch the embedding process.


Summary

RAG with Ollama lets you build a private, free AI assistant that actually knows your documents. Use Open WebUI for instant no-code RAG, or build a custom LangChain pipeline for full control. All processing stays local — your data never leaves your machine.