All Posts
AIRAGArchitecture
Advanced

RAG Architecture Deep Dive: Building Context-Aware AI Applications

Chunking strategies, hybrid search, reranking, HyDE query transformation, and the RAGAS evaluation framework — the complete RAG engineering guide.

M
Mini Bhati··14 min read
0

TLDR: RAG fails in retrieval, not generation. Chunking strategy, hybrid search with reciprocal rank fusion, reranking, and query transformation are where most pipelines go wrong. The RAGAS evaluation triad tells you whether retrieval is working before your users tell you it isn't.

I spent three weeks building a documentation chatbot that confidently gave users wrong answers. Not hallucination in the spooky sci-fi sense — just wrong facts, wrong version numbers, a feature that had been deprecated six months ago but the model had no idea. The model itself was fine. My retrieval was broken.

That's the thing about RAG — Retrieval-Augmented Generation — that nobody tells you upfront: the generation step is almost never your problem. It's the retrieval. And retrieval is a frontend engineering problem as much as a backend one, because the decisions you make upstream (how you chunk, what metadata you attach, how you query) determine what your users experience on the other end.

RAG is the most practical pattern in applied AI right now. It solves the core LLM limitation — the model doesn't know your data and has a knowledge cutoff — without the cost and complexity of fine-tuning. Here's how to build it properly.


Fine-tuning vs RAG: The honest answer

Every team asks this question. Here's a simple frame:

RAG Fine-tuning
Data freshness Real-time Stale at training time
Cost Low — inference only High — GPU compute
Iteration speed Fast — update docs Slow — re-train
Explainability Can cite sources Black box
Best for Dynamic data, large corpora Tone, style, domain vocabulary

Fine-tune for how the model talks. RAG for what the model knows.

If you're building a chatbot over your product's documentation and that documentation changes every sprint — RAG, no question.


The pipeline, end to end

INDEXING (happens offline)
  Documents → chunk → embed → store in vector DB

RETRIEVAL (happens per request)
  User query → embed → vector search → top-k chunks

GENERATION (happens per request)
  [system prompt] + [retrieved chunks] + [user query] → LLM → answer

Simple on paper. Let's talk about where each step breaks.


Step 1: Indexing

Parsing — the boring part that kills you

Your documents aren't clean. They're PDFs with scanned tables, HTML with nav menus embedded in every page, Word docs with tracked changes, Confluence exports with garbled formatting. Parsing quality is the foundation everything else is built on.

I've spent entire mornings just on this step for a single document type. Watch out for: scanned PDFs (you need OCR, not just text extraction), tables (most parsers mangle them into unreadable row soup), and headers/footers that repeat on every page and pollute your embeddings with noise.

Chunking — where most pipelines fail

This is it. This is where I see teams get the most wrong.

Chunks too small: you lose the context that makes a passage make sense. A sentence about "the retry mechanism" without the surrounding paragraph explaining what is being retried is useless.

Chunks too large: they dilute relevance. You retrieve a 3,000-token block when the answer lives in 50 tokens of it, and the model drowns in irrelevant context trying to find it.

Fixed-size chunking (the baseline):

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_documents(docs)

The 64-token overlap matters more than you'd expect. Without it, a chunk boundary that splits a sentence in half destroys the meaning on both sides.

Hierarchical / parent-child chunking (what I'd use in production):

Store parent chunks (full sections) and child chunks (individual paragraphs) separately. Retrieve by child — they're precise. Return the parent — it has full context.

Parent chunk: Full section on "Authentication"
  Child 1: "OAuth 2.0 setup"
  Child 2: "JWT validation"
  Child 3: "Session management"

Query matches Child 2 precisely. You inject the Parent into the prompt so the model has full context. This pattern alone cut my chatbot's hallucination rate significantly.

Metadata — don't skip this

Every chunk needs metadata. Not for fun — for filtering.

{
    "text": "...",
    "metadata": {
        "source": "auth-docs.pdf",
        "page": 12,
        "section": "Authentication",
        "date": "2025-01-15",
        "doc_type": "technical_spec",
        "version": "v2.3"
    }
}

If a user asks about v2 of your API, you want to filter out v1 docs. If they're asking about a feature that launched after a certain date, you can filter by date. Metadata is what makes retrieval precise instead of just directionally correct.

Embedding model choice

Not all embedding models are equal, and the difference shows in production.

Model Dims Quality Cost
text-embedding-3-small 1536 Good Low
text-embedding-3-large 3072 Better Medium
Cohere embed-v3 1024 Better (retrieval-tuned) Medium
BGE-M3 (local) 1024 Excellent Free

Test on your actual data. I've seen text-embedding-3-small beat 3-large on domain-specific retrieval because the smaller model's representations happened to cluster better for that particular vocabulary. Never assume.


Step 2: Retrieval

Vector search is not enough

Pure vector search works great for semantic queries. It completely falls apart for exact matches — product codes, function names, technical error strings. If a user asks about ERR_INVALID_TOKEN_0x43F, a semantic search will retrieve vaguely related content about authentication. You need exact term matching too.

Hybrid search — what I actually ship:

def hybrid_search(query: str, k: int = 10) -> list[Chunk]:
    vector_results = vector_db.search(embed(query), k=k*2)
    bm25_results = bm25_index.search(query, k=k*2)

    # Reciprocal Rank Fusion: combine both ranked lists
    return rrf_merge(vector_results, bm25_results, k=k)

Reciprocal Rank Fusion is the merge strategy that works. A document that appears in both ranked lists gets boosted:

def rrf_merge(list1, list2, k=60) -> list:
    scores = {}
    for rank, doc in enumerate(list1):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
    for rank, doc in enumerate(list2):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Reranking — the step teams skip

After retrieval, you have 20-50 candidate chunks sorted by vector similarity. Similarity is a decent proxy for relevance but it's not precision. A cross-encoder reranker scores each (query, chunk) pair directly:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, chunks: list[Chunk], top_k: int = 5) -> list[Chunk]:
    pairs = [(query, chunk.text) for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [chunk for chunk, _ in ranked[:top_k]]

The two-stage pattern: vector search for recall (get 20-50 candidates fast), reranker for precision (pick the best 5-10 from those). This is materially better than single-stage retrieval.

Query transformation — your query is usually bad

Users don't write retrieval-optimized queries. "How do I set this up?" is a terrible vector search query. "What does error code 403 mean?" will retrieve everything about 403s, not the specific context the user needs.

HyDE (Hypothetical Document Embedding): Instead of embedding the question, use the LLM to generate a hypothetical perfect answer, then embed that for retrieval. You're searching for the answer space, not the question space.

Query expansion: Generate 3 different phrasings of the user's question and retrieve for all of them, then merge. Catches vocabulary mismatches between user language and documentation language.


Step 3: Generation

Structure your prompt so the model knows exactly what to do with the context:

System: You are a technical support assistant. Answer questions using ONLY
        the provided documentation. If the answer isn't in the docs, say
        "I don't have that information in the current documentation."
        Always cite the document source for any claim you make.

Retrieved context:
<doc id="1" source="api-reference.md" relevance="0.92">
{{ chunk_1 }}
</doc>

<doc id="2" source="quickstart.md" relevance="0.87">
{{ chunk_2 }}
</doc>

User question: {{ query }}

Answer (cite [doc id] for each claim):

The explicit citation instruction does two things: it makes hallucination auditable (if there's no citation, be suspicious), and it forces the model to ground its answer in the retrieved docs.

The "lost in the middle" problem: Models attend more strongly to content at the beginning and end of long contexts. The most relevant chunk buried in the middle of 50k tokens of context will underperform. Put your highest-relevance chunks at positions 0 and -1 in the context block.


What breaks in production

Stale index. If your documentation updates and your index doesn't, users get confident wrong answers. Build incremental indexing — only re-embed documents that have actually changed:

def should_reindex(doc: Document, existing: IndexedDoc) -> bool:
    return doc.updated_at > existing.indexed_at or doc.hash != existing.hash

Embedding version lock. When you upgrade your embedding model, every vector in your index is incompatible with the new model. This isn't a warning — you'll start getting random retrieval failures. If you upgrade the embedding model, you re-embed the entire corpus. Plan for it.

Multi-tenancy. If you're serving multiple customers from one system, always filter by tenant at the retrieval step. A metadata filter is the minimum viable isolation:

results = vector_db.search(
    query_vector=embed(query),
    filter={"tenant_id": current_user.tenant_id},
    k=10
)

Evaluation — don't skip this

The RAG triad is what I use to measure system quality:

Dimension What it checks
Faithfulness Is the answer grounded in retrieved docs?
Answer relevance Does the answer address the question?
Context relevance Were the right docs retrieved?

RAGAS automates this with LLM-as-judge scoring. Run it on a test set of 50-100 queries before shipping and add it to CI. The teams who skip evals don't know their RAG is quietly failing until users complain.


RAG systems fail silently. A chatbot that returns wrong information with high confidence is worse than one that says "I don't know" — users trust the confident wrong answer. The difference between a trustworthy RAG system and a liability is almost always in the retrieval layer, not the model. Invest in chunking strategy, ship hybrid search, and measure with evals before you let users near it.

Found this useful? Give it a like.

Newsletter

Stay in the loop

New writing on frontend engineering, system architecture & AI — delivered straight to your inbox. No spam, unsubscribe anytime.