Retrieval-Augmented Generation has gone from academic curiosity to the default architecture for any LLM system that needs to answer questions about private data. But the gap between a demo that impresses investors and a system that handles real production traffic is enormous — and most of the failure modes live in details that no tutorial covers.
This post is a distillation of what we've learned building RAG pipelines for FinTech knowledge bases, legal document systems, and internal engineering wikis. We'll go deep on the decisions that actually affect quality and reliability.
Why RAG Fails in Production
Before we talk about architecture, it's worth naming the failure modes we're designing around:
- Retrieval misses: The right information exists in your corpus but never surfaces.
- Context overflow: You retrieve too much, the LLM loses the signal in the noise, and hallucinations increase.
- Semantic drift: Your embedding model's notion of "similar" doesn't match your users' intent.
- Stale index: Your vector index doesn't reflect current data, so users get confidently wrong answers about things that changed last week.
- No evals: You can't tell if a change made things better or worse, so you're flying blind.
Good architecture addresses all five.
Chunking Strategy: The Most Underestimated Decision
How you split documents before embedding them has an outsized impact on retrieval quality. The naive approach — fixed-size token windows with 20% overlap — works well enough for homogenous prose but fails badly on structured content like API docs, contracts, or financial reports.
Chunk by Semantic Unit, Not Token Count
For most document types, we use a hierarchical chunking approach:
- Split by document section (heading level 2 or 3).
- If a section exceeds your token budget (typically 512 tokens), split further by paragraph.
- Store a "parent chunk" ID so you can expand context during retrieval.
This lets you retrieve at the granular level (paragraph) but serve the LLM a richer context window (section) — a technique often called parent-document retrieval.
from langchain.text_splitter import RecursiveCharacterTextSplitter
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=200)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=40)
parent_docs = parent_splitter.split_documents(raw_docs)
child_docs = []
for i, parent in enumerate(parent_docs):
children = child_splitter.split_documents([parent])
for child in children:
child.metadata["parent_id"] = i
child_docs.extend(children)
# Index child_docs into the vector store.
# At query time, retrieve children, fetch their parent_id,
# then send the full parent text to the LLM.
For tables, code blocks, and financial data, embed the raw text but also store a pre-formatted string representation in metadata so the LLM receives something legible even if the markdown doesn't survive round-tripping through the vector store.
Embedding Model Selection
The embedding model is your retrieval ceiling — no amount of downstream engineering recovers signal the embeddings never captured.
Our current default is text-embedding-3-large (OpenAI) at 3072 dimensions, truncated to 1536 via Matryoshka representation learning. This gives roughly the same quality as the full 3072-dim model at half the storage cost and ~2× the query throughput.
For on-premise or cost-sensitive deployments, BGE-M3 (BAAI) is the strongest open-weight option as of early 2026, with excellent multilingual coverage and a hybrid dense/sparse output that works natively with Elasticsearch and Vespa.
Key selection criteria:
- Run MTEB benchmarks against a sample of your data, not just public leaderboards.
- Measure latency at your expected QPS — a 10ms difference per query matters at scale.
- Check the model's max sequence length. Truncation silently hurts quality for long chunks.
Vector Database Comparison
The right vector database depends on your existing infrastructure and query patterns more than raw benchmark performance.
pgvector
If you're already on Postgres, pgvector is the lowest-friction path to production. With HNSW indexing (added in v0.5.0) it handles tens of millions of vectors with sub-10ms p99 latency. You keep ACID guarantees, existing backup tooling, and row-level security for free. The downside: horizontal scaling requires partitioning strategies you have to manage yourself.
Weaviate
Weaviate is our choice when the use case requires hybrid search (dense + BM25 keyword) out of the box, or when you need multi-tenancy with strict data isolation. Its GraphQL API is expressive, and the managed cloud tier handles sharding transparently. We've run it at 50M+ vectors without issues.
Pinecone
Pinecone is the fastest path to a fully managed production deployment with zero operational overhead. Serverless Pinecone (current default) bills per query rather than per pod, which makes cost predictable for bursty workloads. The trade-off: less flexibility on filtering, no BM25, and data leaves your cloud account.
Rule of thumb: start with pgvector if you're already Postgres-native; move to Weaviate if you need hybrid search or complex tenancy; use Pinecone if the team has zero appetite for infrastructure work.
Retrieval Fusion and Re-ranking
Single-vector retrieval leaves quality on the table. We run a two-stage retrieval pipeline in production:
- Stage 1 — Multi-vector recall: Run the query through both dense embedding search and BM25 keyword search. Merge results with Reciprocal Rank Fusion (RRF), which is surprisingly robust and requires no training.
- Stage 2 — Re-ranking: Pass the top-30 candidates to a cross-encoder re-ranker (we use Cohere Rerank v3 or the open-weight
ms-marco-MiniLM-L-12-v2) to re-score and surface the true top-5.
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
from langchain_core.documents import Document
# dense_retriever and bm25_retriever are pre-configured
ensemble = EnsembleRetriever(
retrievers=[dense_retriever, bm25_retriever],
weights=[0.6, 0.4],
)
candidates = ensemble.invoke(query) # top 30
reranked = cohere_reranker.compress_documents(candidates, query) # top 5
In our benchmarks, adding re-ranking improved answer relevancy by ~18 percentage points on our internal eval sets, with a latency overhead of 80–120ms — a trade-off worth making for anything customer-facing.
Evaluation With RAGAS
You cannot improve what you don't measure. RAGAS is the de facto evaluation framework for RAG pipelines and measures four things:
- Faithfulness: Does the answer contain only information present in the retrieved context? (Hallucination proxy.)
- Answer Relevancy: Does the answer address the question?
- Context Precision: Are the retrieved chunks actually relevant?
- Context Recall: Did retrieval surface all the information needed to answer?
We integrate RAGAS into our CI pipeline: every PR that touches chunking, embedding config, or retrieval parameters must show a net improvement or neutral delta on a golden eval dataset of 200 question–answer pairs sampled from real user queries. This has caught regressions that would otherwise have shipped silently.
Index Freshness and Incremental Updates
For systems where source documents change frequently (knowledge bases, policy documents, API changelogs), a nightly full re-index is operationally simpler but expensive. We prefer an incremental update pattern:
- Hash every source document on ingest. Store
doc_hash → [chunk_ids]in a side table. - On re-index, recompute hashes. Only re-embed and upsert chunks for documents whose hash changed.
- Delete orphaned chunk IDs for documents that were removed.
This cuts re-index cost by 80–95% for most document corpora with a daily change rate under 10%.
Key Takeaways
- Hierarchical parent-document chunking outperforms fixed-size windows for structured content.
- Run MTEB benchmarks on your own data; public leaderboards don't tell the full story.
- pgvector for Postgres shops, Weaviate for hybrid search + tenancy, Pinecone for zero-ops.
- RRF fusion + cross-encoder re-ranking is a reliable ~18-point relevancy improvement.
- Integrate RAGAS into CI — treat eval regressions the same as test failures.
- Incremental hashed updates keep index freshness cheap.