Most RAG implementations fail not because the concept is wrong, but because the details are wrong. Chunk too large and you dilute relevance. Chunk too small and you lose context. Use the wrong embedding model and your retrieval returns garbage. This guide covers the decisions that separate working RAG from broken RAG.
Foundation: Why RAG Exists
Language models have a knowledge cutoff and a context window limit. RAG solves both problems by retrieving relevant information from external sources and injecting it into the prompt. Instead of asking the model to recall something from training, you give it the answer and ask it to synthesize.
Chunking: The Most Underrated Decision
Why Chunk Size Matters
Your documents must be split into chunks for embedding and retrieval. Too large (2000+ tokens) and each chunk contains too many topics — retrieval returns semi-relevant blobs. Too small (100 tokens) and chunks lose coherent meaning — you retrieve sentence fragments without context.
The Sweet Spot
For most document types, 300-500 tokens with 50-token overlap works well. The overlap ensures that information split across chunk boundaries is not lost entirely. For structured documents (code, legal text, manuals), respect natural boundaries — chunk at function/section/clause level rather than arbitrary token counts.
Hierarchical Chunking
Advanced systems use two-level chunking: small chunks (200 tokens) for precise retrieval, linked to parent chunks (1000 tokens) for context. Retrieve using the small chunk's embedding, but inject the parent chunk into the prompt. This gives you precision and context simultaneously.
Embedding: Choosing the Right Model
The embedding model converts text into vectors. The quality of your embeddings determines whether retrieval finds the right chunks.
- For general purpose: Use models from the MTEB leaderboard top-10. These balance quality across retrieval, classification, and clustering tasks.
- For domain-specific: Fine-tune an embedding model on your domain data. A medical RAG system with a general embedding model will miss domain-specific synonyms and relationships.
- For self-hosted: Smaller embedding models (under 500M parameters) run fast on CPU and produce good results. You do not need a massive embedding model.
Retrieval: Beyond Basic Similarity
Hybrid Search
Pure vector similarity misses exact keyword matches. Pure keyword search misses semantic relationships. Hybrid search combines both — typically a weighted blend of vector cosine similarity and BM25 keyword scoring. Most production RAG systems use hybrid search.
Reranking
Initial retrieval returns the top 20-50 candidates quickly. A reranker (a cross-encoder model) then rescores these candidates with much higher accuracy, since it can attend to the full query-document interaction. This two-stage pipeline is more accurate than increasing the vector search quality alone.
Common Misconceptions
"More retrieved context is better"
Beyond 3-5 relevant chunks, additional context often degrades quality. The model struggles to find the needle in a larger haystack. Retrieve less, but retrieve better.
"The most expensive embedding model is the best"
Embedding quality plateaus quickly. A well-tuned 100M parameter embedding model often outperforms a generic 1B parameter model for your specific use case.
Going Deeper
Implement a RAG pipeline from scratch using Ollama for generation and a local vector database like ChromaDB. Experiment with chunk sizes and observe how retrieval quality changes. Then add hybrid search and reranking to see the improvement firsthand.
[INTERNAL: Implement RAG from Scratch with Ollama and ChromaDB]
[INTERNAL: Self-Hosted Vector Databases: Chroma vs Qdrant vs Milvus]
Frequently Asked Questions
What vector database should I use?
For getting started: ChromaDB (simple, embeds in your app). For production: Qdrant or Weaviate (better scaling, filtering, and persistence). For massive scale: Milvus or Pinecone.
How do I evaluate RAG quality?
Measure retrieval recall (does the right chunk appear in results?) and answer faithfulness (does the model stick to retrieved context?). Build a test set of 50-100 question-answer pairs from your documents and automate evaluation.
Can I use RAG with small models?
Yes, and it is often more effective. RAG compensates for smaller model knowledge by providing explicit context. A 7B model with good RAG often outperforms a 70B model without it on domain-specific tasks.