Building RAG Pipelines That Actually Work
The Promise vs. Reality
Every RAG tutorial makes it look effortless: chunk your documents, embed them, throw them into a vector database, and query with an LLM. Ship it. Done.
In practice, the gap between a demo and a production RAG system is enormous. After building several pipelines — from a university chatbot to a document generation system — here's what I've learned.
Chunking Strategy Is Everything
The single biggest lever you have is how you split your documents. Fixed-size chunks (512 tokens, 1000 characters) are the default everyone reaches for. They're also usually wrong.
What works better:
- Semantic chunking — split on paragraph boundaries, headers, or natural topic shifts
- Overlapping windows — 20% overlap between chunks catches context that straddles boundaries
- Metadata enrichment — attach the parent section title, document name, and position to every chunk
A chunk that says "The results showed a 40% improvement" is useless without knowing what was being measured. Metadata solves this.
Embedding Model Selection
Not all embeddings are created equal. For English technical content, I've found:
Tier 1: OpenAI text-embedding-3-large (best quality, costs money)
Tier 2: BGE-large-en-v1.5 (open source, surprisingly close)
Tier 3: all-MiniLM-L6-v2 (fast, good enough for prototypes)
The key insight: match your embedding model to your query style. If users ask formal questions, a model trained on Q&A pairs outperforms one trained on general text.
FAISS vs. ChromaDB
I've used both extensively:
FAISS is pure speed. It's a library, not a database — you manage persistence yourself. Perfect when you need sub-millisecond search on millions of vectors and you're comfortable with the operational overhead.
ChromaDB is developer experience. Built-in persistence, metadata filtering, and a clean Python API. Perfect for prototypes and applications under a million documents.
My rule: start with ChromaDB, migrate to FAISS when (if) you hit scale constraints.
The Retrieval-Generation Gap
The hardest problem isn't retrieval — it's what happens between retrieval and generation. You retrieve 5 chunks, but:
- Chunk 3 contradicts chunk 1
- Chunk 5 is from an outdated version of the docs
- Chunks 2 and 4 say the same thing differently
Solutions that work:
- Re-ranking — use a cross-encoder to re-score retrieved chunks against the actual query
- Deduplication — cluster similar chunks and pick the best representative
- Freshness weighting — multiply relevance scores by a time-decay factor
Prompt Engineering for RAG
The system prompt matters more than you think. Here's a pattern that works well with Claude:
You are answering questions based on retrieved context.
Rules:
1. Only use information from the provided context
2. If the context doesn't contain the answer, say so
3. Cite which section your answer comes from
4. If sources conflict, acknowledge the contradiction
Context:
{retrieved_chunks}
The explicit instruction to acknowledge limitations prevents hallucination better than any amount of temperature tuning.
What I'd Do Differently
If I started over today, I'd invest more time in evaluation before building. Create a test set of 50 question-answer pairs, measure retrieval recall and answer quality, then iterate on chunking and retrieval before touching the generation side.
The generation model is the easy part. The pipeline around it is where the real engineering lives.