RAG Architecture Deep Dive: Building Context-Aware AI Applications

TLDR: RAG fails in retrieval, not generation. Chunking strategy, hybrid search with reciprocal rank fusion, reranking, and query transformation are where most pipelines go wrong. The RAGAS evaluation triad tells you whether retrieval is working before your users tell you it isn't.

I spent three weeks building a documentation chatbot that confidently gave users wrong answers. Not hallucination in the spooky sci-fi sense — just wrong facts, wrong version numbers, a feature that had been deprecated six months ago but the model had no idea. The model itself was fine. My retrieval was broken.

That's the thing about RAG — Retrieval-Augmented Generation — that nobody tells you upfront: the generation step is almost never your problem. It's the retrieval. And retrieval is a frontend engineering problem as much as a backend one, because the decisions you make upstream (how you chunk, what metadata you attach, how you query) determine what your users experience on the other end.

RAG is the most practical pattern in applied AI right now. It solves the core LLM limitation — the model doesn't know your data and has a knowledge cutoff — without the cost and complexity of fine-tuning. Here's how to build it properly.

Fine-tuning vs RAG: The honest answer

Every team asks this question. Here's a simple frame:

	RAG	Fine-tuning
Data freshness	Real-time	Stale at training time
Cost	Low — inference only	High — GPU compute
Iteration speed	Fast — update docs	Slow — re-train
Explainability	Can cite sources	Black box
Best for	Dynamic data, large corpora	Tone, style, domain vocabulary

Fine-tune for how the model talks. RAG for what the model knows.

If you're building a chatbot over your product's documentation and that documentation changes every sprint — RAG, no question.

The pipeline, end to end

INDEXING (happens offline)
  Documents → chunk → embed → store in vector DB

RETRIEVAL (happens per request)
  User query → embed → vector search → top-k chunks

GENERATION (happens per request)
  [system prompt] + [retrieved chunks] + [user query] → LLM → answer

Simple on paper. Let's talk about where each step breaks.

Step 1: Indexing

Parsing — the boring part that kills you

Your documents aren't clean. They're PDFs with scanned tables, HTML with nav menus embedded in every page, Word docs with tracked changes, Confluence exports with garbled formatting. Parsing quality is the foundation everything else is built on.

I've spent entire mornings just on this step for a single document type. Watch out for: scanned PDFs (you need OCR, not just text extraction), tables (most parsers mangle them into unreadable row soup), and headers/footers that repeat on every page and pollute your embeddings with noise.

Chunking — where most pipelines fail

This is it. This is where I see teams get the most wrong.

Chunks too small: you lose the context that makes a passage make sense. A sentence about "the retry mechanism" without the surrounding paragraph explaining what is being retried is useless.

Chunks too large: they dilute relevance. You retrieve a 3,000-token block when the answer lives in 50 tokens of it, and the model drowns in irrelevant context trying to find it.

Fixed-size chunking (the baseline):

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ". ", " "],
)
chunks = splitter.split_documents(docs)

The 64-token overlap matters more than you'd expect. Without it, a chunk boundary that splits a sentence in half destroys the meaning on both sides.

Hierarchical / parent-child chunking (what I'd use in production):

Store parent chunks (full sections) and child chunks (individual paragraphs) separately. Retrieve by child — they're precise. Return the parent — it has full context.

Parent chunk: Full section on "Authentication"
  Child 1: "OAuth 2.0 setup"
  Child 2: "JWT validation"
  Child 3: "Session management"

Query matches Child 2 precisely. You inject the Parent into the prompt so the model has full context. This pattern alone cut my chatbot's hallucination rate significantly.

Metadata — don't skip this

Every chunk needs metadata. Not for fun — for filtering.

{
    "text": "...",
    "metadata": {
        "source": "auth-docs.pdf",
        "page": 12,
        "section": "Authentication",
        "date": "2025-01-15",
        "doc_type": "technical_spec",
        "version": "v2.3"
    }
}

If a user asks about v2 of your API, you want to filter out v1 docs. If they're asking about a feature that launched after a certain date, you can filter by date. Metadata is what makes retrieval precise instead of just directionally correct.

Embedding model choice

Not all embedding models are equal, and the difference shows in production.

Model	Dims	Quality	Cost
text-embedding-3-small	1536	Good	Low
text-embedding-3-large	3072	Better	Medium
Cohere embed-v3	1024	Better (retrieval-tuned)	Medium
BGE-M3 (local)	1024	Excellent	Free

Test on your actual data. I've seen text-embedding-3-small beat 3-large on domain-specific retrieval because the smaller model's representations happened to cluster better for that particular vocabulary. Never assume.

Step 2: Retrieval

Vector search is not enough

Pure vector search works great for semantic queries. It completely falls apart for exact matches — product codes, function names, technical error strings. If a user asks about ERR_INVALID_TOKEN_0x43F, a semantic search will retrieve vaguely related content about authentication. You need exact term matching too.

Hybrid search — what I actually ship:

def hybrid_search(query: str, k: int = 10) -> list[Chunk]:
    vector_results = vector_db.search(embed(query), k=k*2)
    bm25_results = bm25_index.search(query, k=k*2)

    # Reciprocal Rank Fusion: combine both ranked lists
    return rrf_merge(vector_results, bm25_results, k=k)

Reciprocal Rank Fusion is the merge strategy that works. A document that appears in both ranked lists gets boosted:

def rrf_merge(list1, list2, k=60) -> list:
    scores = {}
    for rank, doc in enumerate(list1):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
    for rank, doc in enumerate(list2):
        scores[doc.id] = scores.get(doc.id, 0) + 1 / (k + rank + 1)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

Reranking — the step teams skip

After retrieval, you have 20-50 candidate chunks sorted by vector similarity. Similarity is a decent proxy for relevance but it's not precision. A cross-encoder reranker scores each (query, chunk) pair directly:

from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, chunks: list[Chunk], top_k: int = 5) -> list[Chunk]:
    pairs = [(query, chunk.text) for chunk in chunks]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return [chunk for chunk, _ in ranked[:top_k]]

The two-stage pattern: vector search for recall (get 20-50 candidates fast), reranker for precision (pick the best 5-10 from those). This is materially better than single-stage retrieval.

Query transformation — your query is usually bad

Users don't write retrieval-optimized queries. "How do I set this up?" is a terrible vector search query. "What does error code 403 mean?" will retrieve everything about 403s, not the specific context the user needs.

HyDE (Hypothetical Document Embedding): Instead of embedding the question, use the LLM to generate a hypothetical perfect answer, then embed that for retrieval. You're searching for the answer space, not the question space.

Query expansion: Generate 3 different phrasings of the user's question and retrieve for all of them, then merge. Catches vocabulary mismatches between user language and documentation language.

Step 3: Generation

Structure your prompt so the model knows exactly what to do with the context:

System: You are a technical support assistant. Answer questions using ONLY
        the provided documentation. If the answer isn't in the docs, say
        "I don't have that information in the current documentation."
        Always cite the document source for any claim you make.

Retrieved context:
<doc id="1" source="api-reference.md" relevance="0.92">
{{ chunk_1 }}
</doc>

<doc id="2" source="quickstart.md" relevance="0.87">
{{ chunk_2 }}
</doc>

User question: {{ query }}

Answer (cite [doc id] for each claim):

The explicit citation instruction does two things: it makes hallucination auditable (if there's no citation, be suspicious), and it forces the model to ground its answer in the retrieved docs.

The "lost in the middle" problem: Models attend more strongly to content at the beginning and end of long contexts. The most relevant chunk buried in the middle of 50k tokens of context will underperform. Put your highest-relevance chunks at positions 0 and -1 in the context block.

What breaks in production

Stale index. If your documentation updates and your index doesn't, users get confident wrong answers. Build incremental indexing — only re-embed documents that have actually changed:

def should_reindex(doc: Document, existing: IndexedDoc) -> bool:
    return doc.updated_at > existing.indexed_at or doc.hash != existing.hash

Embedding version lock. When you upgrade your embedding model, every vector in your index is incompatible with the new model. This isn't a warning — you'll start getting random retrieval failures. If you upgrade the embedding model, you re-embed the entire corpus. Plan for it.

Multi-tenancy. If you're serving multiple customers from one system, always filter by tenant at the retrieval step. A metadata filter is the minimum viable isolation:

results = vector_db.search(
    query_vector=embed(query),
    filter={"tenant_id": current_user.tenant_id},
    k=10
)

Evaluation — don't skip this

The RAG triad is what I use to measure system quality:

Dimension	What it checks
Faithfulness	Is the answer grounded in retrieved docs?
Answer relevance	Does the answer address the question?
Context relevance	Were the right docs retrieved?

RAGAS automates this with LLM-as-judge scoring. Run it on a test set of 50-100 queries before shipping and add it to CI. The teams who skip evals don't know their RAG is quietly failing until users complain.

RAG systems fail silently. A chatbot that returns wrong information with high confidence is worse than one that says "I don't know" — users trust the confident wrong answer. The difference between a trustworthy RAG system and a liability is almost always in the retrieval layer, not the model. Invest in chunking strategy, ship hybrid search, and measure with evals before you let users near it.

I Used AI for Angular Architecture Reviews for 3 Months. Here's What Changed.

Real-world lessons from using AI as an architectural reviewer: what improved, where AI was weak, and how to structure prompts for useful feedback.

9 min read ·May 18, 2026

Read

AIMCPAgents

Beginner

What is MCP? The Protocol That Finally Connects AI to the Real World

MCP (Model Context Protocol) is the USB-C of AI tooling — a universal standard that lets any AI model talk to any data source or tool without custom glue code per integration. If you've ever wondered how Claude 'knows' what's in your files or can run a terminal command, this is the protocol doing that work.

9 min read ·May 19, 2026

Read

AIMCPAgentsEngineering

Intermediate

Building Your First MCP Server: Tools, Resources, and the Right Mental Model

Building an MCP server is simpler than it looks — a few tool definitions, a request handler, and a stdio transport. The hard part is designing tools the model will actually use correctly. This guide builds a real server from scratch and covers every design decision that separates a good server from a frustrating one.

13 min read ·May 19, 2026

Read

Back to all posts

Fine-tuning vs RAG: The honest answer

The pipeline, end to end

Step 1: Indexing

Parsing — the boring part that kills you

Chunking — where most pipelines fail

Metadata — don't skip this

Embedding model choice

Step 2: Retrieval

Vector search is not enough

Reranking — the step teams skip

Query transformation — your query is usually bad

Step 3: Generation

What breaks in production

Evaluation — don't skip this

I Used AI for Angular Architecture Reviews for 3 Months. Here's What Changed.

What is MCP? The Protocol That Finally Connects AI to the Real World

Building Your First MCP Server: Tools, Resources, and the Right Mental Model

Stay in the loop