The Problem With One-Size-Fits-All RAG
Most organisations that deploy retrieval-augmented generation start with the simplest possible setup: chunk your documents, embed them, store them in a vector database, retrieve the top-k results, and pass them to a language model. It works well enough in a demo. Then production hits, and the cracks appear.
Queries that require reasoning across multiple documents return incomplete answers. Highly technical documents get mangled because the chunking strategy ignored document structure. Users ask questions that need real-time data, but the pipeline only knows what was indexed last Tuesday. The retrieval step confidently surfaces irrelevant content, and the model hallucinates to fill the gaps.
The root cause is almost always the same: the architecture was chosen for convenience, not for the actual characteristics of the data and the queries it needs to handle. Understanding the available rag architecture patterns - and knowing which one fits your situation - is what separates a system that works in demos from one that works in production.
This article walks through the main patterns, when each one applies, and what trade-offs you are actually making when you choose one over another.
Naive RAG: When Simple Is Enough
The basic retrieve-then-generate pipeline is not a bad starting point. For many use cases, it is the right answer. The pattern works well when:
- Your document corpus is relatively homogeneous (e.g., a collection of policy documents all written in the same format)
- Queries are self-contained and do not require synthesising information from multiple sources
- Freshness requirements are low and a weekly or daily index refresh is acceptable
- You are building an internal tool where occasional retrieval failures are tolerable
A practical example: a mid-sized accounting firm wants to let staff query their internal procedure manuals. The documents are structured similarly, queries are specific ("what is the approval threshold for client entertainment expenses?"), and the cost of a wrong answer is low because staff can verify the source. A naive RAG pipeline with good chunking and a well-tuned similarity threshold will handle this cleanly.
Where naive RAG breaks down is when queries are ambiguous, when answers require aggregating information from several documents, or when the corpus contains documents with very different structures and densities. In those cases, you need to move up the complexity ladder.
Advanced RAG: Fixing the Retrieval Step
Advanced RAG keeps the same overall structure but addresses the weaknesses in how retrieval is performed. The key insight is that the embedding similarity between a query and a document chunk is a noisy signal. There are several techniques for improving it.
Query rewriting uses a language model to expand or rephrase the original query before retrieval. A user asking "what did we decide about the Brisbane office?" might get better results if the query is rewritten to include synonyms and related terms before hitting the vector store.
Hybrid search combines dense vector retrieval with sparse keyword search (typically BM25). Dense retrieval handles semantic similarity; sparse retrieval handles exact term matching. For technical documentation where specific model numbers, product codes, or regulatory references matter, hybrid search consistently outperforms either approach alone.
Re-ranking adds a second retrieval stage. The first pass retrieves a larger candidate set (say, top-20 results), and a cross-encoder model re-scores and reorders them before the top results are passed to the language model. Cross-encoders are slower than bi-encoders but significantly more accurate because they can consider the query and document together rather than comparing separate embeddings.
Contextual chunking moves away from fixed-size text windows and instead preserves document structure - splitting on sections, tables, and headings rather than arbitrary token counts. For legal or technical documents, this alone can produce a meaningful improvement in retrieval quality.
A government agency managing a large body of legislative guidance documents is a good candidate for advanced RAG. The documents have formal structure, queries often reference specific clause numbers or defined terms, and the cost of surfacing the wrong guidance is high. Hybrid search plus re-ranking plus structure-aware chunking addresses each of those constraints directly.
Modular and Agentic RAG: When Retrieval Needs to Be Dynamic
Some queries cannot be answered with a single retrieval pass. The question "how has our customer churn rate changed since we updated the onboarding flow?" requires pulling data from multiple sources, potentially running a calculation, and synthesising a coherent answer. No static retrieval pipeline handles that cleanly.
Modular RAG treats retrieval as one tool among several. The system can route a query to different retrievers depending on its type - a vector store for unstructured documents, a SQL database for structured data, an API for real-time information. The routing logic can be rules-based or model-driven.
Agentic RAG goes further. A language model acts as an orchestrator, deciding at each step whether to retrieve more information, call a tool, or produce a final answer. The model can issue multiple retrieval queries in sequence, reason about whether the results are sufficient, and continue gathering information until it has what it needs. This is sometimes called iterative or multi-hop retrieval.
The trade-off is latency and complexity. An agentic pipeline might take several seconds to complete because it is running multiple retrieval and reasoning steps. It is also harder to debug - when something goes wrong, the failure could be in the routing logic, the retrieval step, the tool call, or the final generation.
A financial services firm building an internal research assistant is a realistic candidate for this pattern. Analysts ask compound questions that span structured portfolio data, unstructured research notes, and live market feeds. A single-pass retrieval system cannot serve that use case. An agentic architecture with careful guardrails and logging can.
GraphRAG: When Relationships Matter
Standard vector retrieval treats documents as independent units. It does not understand that Document A references Document B, or that Entity X appears in five different contexts across the corpus. GraphRAG addresses this by building a knowledge graph over the document corpus and using graph traversal as part of the retrieval process.
Microsoft's GraphRAG research demonstrated that for queries requiring global summarisation - "what are the main themes across all of these documents?" - graph-based retrieval significantly outperforms naive vector search. The graph structure lets the system identify communities of related concepts and retrieve information at the right level of abstraction.
This pattern is most useful when:
- Your corpus has dense cross-referencing between documents (regulatory frameworks, academic literature, legal case law)
- Users frequently ask comparative or thematic questions rather than factual lookups
- You need to surface relationships between entities that are not explicitly stated in any single document
The cost is significant. Building and maintaining a knowledge graph adds infrastructure complexity, and graph construction requires entity extraction and relationship mapping that is computationally expensive. For most organisations, GraphRAG is worth evaluating only after simpler patterns have been exhausted.
Selecting the Right Pattern for Your Context
Choosing between rag architecture patterns is not primarily a technical decision - it is a product and operational decision. Start by answering these questions honestly:
What are the characteristics of your queries?
- Single-fact lookups favour naive or advanced RAG
- Multi-hop reasoning favours agentic RAG
- Thematic or relational queries favour GraphRAG
What are your latency requirements?
- Sub-second responses push you toward simpler architectures
- Analytical use cases where users expect to wait a few seconds open up agentic approaches
What is the cost of a wrong answer?
- Low-stakes internal tools can tolerate retrieval noise
- Customer-facing or compliance-critical applications need higher precision, which usually means re-ranking and stricter retrieval thresholds
How often does your data change?
- Frequently updated data requires efficient incremental indexing
- Real-time requirements may push you toward hybrid architectures that combine indexed retrieval with live API calls
What can you actually maintain?
- A sophisticated agentic pipeline that your team cannot debug or monitor is worse than a simpler system that works reliably
- Operational complexity is a real cost that gets paid every time something breaks in production
One useful heuristic: start with the simplest pattern that could plausibly work, measure its failure modes carefully, and upgrade only the specific components that are causing problems. Jumping straight to agentic RAG because it sounds more capable is a reliable way to build something expensive and fragile.
Evaluation Is Not Optional
Whichever rag architecture patterns you implement, you need a way to measure whether they are working. This is where many projects fall short. Teams build a pipeline, run a few manual tests, decide it looks reasonable, and ship it. Six months later, they have no idea whether the system is actually answering questions correctly.
A minimal evaluation framework for RAG should cover:
- Retrieval quality - are the right chunks being retrieved? Measure recall and precision against a labelled test set
- Answer faithfulness - does the generated answer accurately reflect the retrieved context, or is the model adding information that is not there?
- Answer relevance - does the answer actually address what the user asked?
- End-to-end accuracy - for questions with known correct answers, how often does the system get it right?
Tools like RAGAS provide automated metrics for some of these dimensions. They are not perfect, but they are far better than relying on subjective human review alone. Building evaluation into your development process from the start lets you make architecture decisions based on data rather than intuition.
What to Do Next
If you are currently running a naive RAG pipeline and hitting quality problems, the most productive first step is to analyse your failure cases systematically. Collect 50-100 examples where the system gave a poor answer and categorise them: were the right documents retrieved but poorly chunked? Was the retrieval correct but the generation unfaithful? Did the query require information from multiple documents? The pattern of failures tells you which architectural component to address first.
If you are starting a new RAG project, spend time upfront characterising your query types and document corpus before committing to an architecture. A one-hour workshop with the people who will actually use the system will surface requirements that no amount of technical experimentation will reveal.
If you are not sure where to start, or if you have a complex use case that does not fit neatly into any of the patterns described here, get in touch with the Exponential Tech team. We work with Australian organisations to design and build RAG systems that are appropriate for their actual operational context - not the one that looked good in a conference talk.