Building RAG Pipelines That Actually Work

The Promise vs. Reality

Every RAG tutorial makes it look effortless: chunk your documents, embed them, throw them into a vector database, and query with an LLM. Ship it. Done.

In practice, the gap between a demo and a production RAG system is enormous. After building several pipelines — from a university chatbot to a document generation system — here's what I've learned.

Chunking Strategy Is Everything

The single biggest lever you have is how you split your documents. Fixed-size chunks (512 tokens, 1000 characters) are the default everyone reaches for. They're also usually wrong.

What works better:

A chunk that says "The results showed a 40% improvement" is useless without knowing what was being measured. Metadata solves this.

Embedding Model Selection

Not all embeddings are created equal. For English technical content, I've found:

Tier 1: OpenAI text-embedding-3-large (best quality, costs money)
Tier 2: BGE-large-en-v1.5 (open source, surprisingly close)
Tier 3: all-MiniLM-L6-v2 (fast, good enough for prototypes)

The key insight: match your embedding model to your query style. If users ask formal questions, a model trained on Q&A pairs outperforms one trained on general text.

FAISS vs. ChromaDB

I've used both extensively:

FAISS is pure speed. It's a library, not a database — you manage persistence yourself. Perfect when you need sub-millisecond search on millions of vectors and you're comfortable with the operational overhead.

ChromaDB is developer experience. Built-in persistence, metadata filtering, and a clean Python API. Perfect for prototypes and applications under a million documents.

My rule: start with ChromaDB, migrate to FAISS when (if) you hit scale constraints.

The Retrieval-Generation Gap

The hardest problem isn't retrieval — it's what happens between retrieval and generation. You retrieve 5 chunks, but:

Solutions that work:

  1. Re-ranking — use a cross-encoder to re-score retrieved chunks against the actual query
  2. Deduplication — cluster similar chunks and pick the best representative
  3. Freshness weighting — multiply relevance scores by a time-decay factor

Prompt Engineering for RAG

The system prompt matters more than you think. Here's a pattern that works well with Claude:

You are answering questions based on retrieved context.

Rules:
1. Only use information from the provided context
2. If the context doesn't contain the answer, say so
3. Cite which section your answer comes from
4. If sources conflict, acknowledge the contradiction

Context:
{retrieved_chunks}

The explicit instruction to acknowledge limitations prevents hallucination better than any amount of temperature tuning.

What I'd Do Differently

If I started over today, I'd invest more time in evaluation before building. Create a test set of 50 question-answer pairs, measure retrieval recall and answer quality, then iterate on chunking and retrieval before touching the generation side.

The generation model is the easy part. The pipeline around it is where the real engineering lives.

#rag#vector-db#claude#faiss