RAG Architecture Layers — 24 Patterns in a Cognitive Stack

VZ editorial frame

Read this piece through one operating lens: AI does not automate first, it amplifies first. If the underlying decision architecture is clear, AI scales clarity. If it is noisy, AI scales noise and cost.

VZ Lens

From the VZ perspective, this topic matters only when translated into execution architecture. An LLM model is just neon lights in the code, not intelligence. The illusion of intelligence comes from the layers—memory, attention, control, feedback. 24 patterns, 24 error modes. Its business impact starts when this becomes a weekly operating discipline.

TL;DR — RAG is not a pipeline. It is a layered cognitive architecture.

RAG (Retrieval-Augmented Generation) is not about choosing a “good embedding” and a “good index.” Quality is determined by how the system organizes memory, retrieves information, allocates attention, and handles uncertainty. This isn’t pipeline design—it’s cognitive architecture design. 24 recurring patterns, each highlighting a specific failure mode. The LLM model is just neon lights in the code—the illusion of intelligence comes from the layers.

“The system doesn’t think where the model runs—but where the layers meet.”

Why aren’t good embeddings and a good index enough?

Over the past few months, I’ve been iterating on my own personal and corporate RAG projects—with sharp questions, real documents, and real errors. The most important lesson I learned is that RAG is not a pipeline, but a layered cognitive architecture. A “good embedding” and a “good index” are not enough, because quality is determined by how the system organizes memory, retrieves information, allocates attention, and handles uncertainty.

This insight is not new. The classic questions of cognitive architecture design—which the ACT-R (Adaptive Control of Thought — Rational) and SOAR systems have been exploring since the 1980s — can be very clearly translated into RAG terminology. It is not the technology that is new, but the context of its application.

Four fundamental questions:

What does long-term memory consist of? How do you represent knowledge, and how do you search within it? This concerns the corpus, indexes, metadata, hierarchy, and graph.
What is working memory? How much does it hold, and how do you protect it from noise? This concerns the context window, context building, reranking, and compression.
How does the system choose goals and sequences of actions? When is one step enough, and when are more needed? This concerns routing, query rewriting, fusion, and multihop.
Where are control and metacognition? How does the system recognize that it doesn’t know enough? When does it correct itself? This is a question of feedback mechanisms such as Self-RAG, Corrective RAG, and Active RAG.

In the ACT-R system, declarative memory and procedural memory operate as separate modules; the attentional system allocates capacity, and metacognitive control regulates when to seek new information. In SOAR, working memory, long-term memory, and the decision cycle perform exactly the same functions that the indexes, context window, and routing perform in RAG. The parallelism is not a metaphor—it is a design framework.

Why is the RAG architecture a cognitive parallel?

You build a cognitive architecture when you optimize not the answer, but the order of retrieval, the quality of evidence, and the loops of correction. Naive RAG is baseline memory. The layers are attention, problem decomposition, and self-checking. This results in decision support—not narrative generation.

Cognitive function	RAG-compatible	Failure mode if missing
Long-term memory	Corpus + indexes + graph	Poor source, domain confusion
Working memory	Context window + reranking	Noise, irrelevant context
Attention selection	Reranking + compression	Model infers from noise
Goal selection	Routing + query rewriting	Poor search strategy
Problem decomposition	Multihop + iteration	Complex question in a single step
Metacognition	Self-RAG + Corrective RAG	Confident error
Executive control	Governance + audit	Wrong information ends up in the wrong place

The relationship between RAG and cognitive architecture is therefore not a poetic metaphor, but a design context. The context window is working memory. Retrieval and reranking are the attention system. Multi-step search is the problem-solving sequence of operations. The metacognitive layers are the error detection and correction mechanisms.

And there is another corporate, very important component that is implicit in cognitive architectures as well, but here it is explicitly stated: governance (regulatory and control framework). Without authorization, data protection, auditing, and measurability, the system will not be “smart,” but unpredictable—because the correct answer to the same question varies by role and time, and the most costly error is often not the mistake itself, but the placement of the wrong information in the wrong place.

I. Memory and Access — The Layers of the Corpus

The first layer is the system’s “memory.” All other cognitive functions stand or fall on this: if memory is poor, attention and control are useless, no matter how excellent they may be. In cognitive science, this is declarative memory—the facts, documents, and rules that form the system’s “knowledge base.”

1. Naive RAG — the baseline approach

This is the minimal working solution. An “embedding” is created from the question, the system retrieves a few text fragments that seem relevant, and then generates an answer based on them. Its value lies not in its sophistication, but in its role as a diagnostic tool: it shows how viable your corpus, your chunking (text segmentation), and the underlying relevance are.

Corporate example. An internal IT helpdesk assistant that answers questions about VPN settings, password reset processes, and basic permissions based on the wiki—and this alone reduces the ticket load.

Troubleshooting. If the naive RAG doesn’t work, it’s not the architecture that’s at fault—it’s the corpus, the indexing, or the chunking. This is the layer where the problem must first be diagnosed.

2. Hybrid RAG — dense + sparse search

Here, the system searches not only “by meaning” (dense, semantic) but also for “exact matches” (sparse, keyword-based). The combination of the two is important because corporate knowledge is full of identifiers, error codes, product codes, names, and versions. This layer comes in handy when the semantics are good—but lead to the wrong document.

Corporate example. During an incident, someone types in “invalid_grant 401,” and the system finds the exact runbook (operational guide) based on the error code, not a general “SSO issues” page.

Error mode. A purely semantic search returns a document with “similar meaning”—not the exact one. Without the sparse layer, the system “interprets” the error code instead of looking it up.

3. Multi-index RAG — separating domains

You don’t search for everything in a single shared index, but across multiple knowledge domains: separate HR policy, separate legal, separate ops, separate sales. This is important because the vast majority of relevance is often determined by which “shelf” you reach for—and only then by what you find on that shelf.

Corporate example. The same word—“termination”—means “contract terms” in the legal index and “termination of employment” in the HR index. Choosing the wrong index is the most common cause of incorrect answers.

How it goes wrong. A single “mixed” index causes domain confusion. The system doesn’t know which shelf to retrieve from—and neither does the user.

II. Attention and Noise Reduction—Maintaining Context

The second layer is the system’s “attention system.” The question isn’t what it finds—but what it allows into the model’s context window. In cognitive science, this is selective attention and working memory management: the deliberate allocation of finite capacity.

4. Query routing — selecting the correct path

The system first decides whether to search for a document, retrieve data, or whether no external source is needed at all. This is important because “always retrieve” often produces noise, while “never retrieve” produces hallucinations.

Corporate example. “How many vacation days do I have left?” — not a document, but an HR system query. “What are the rules for granting vacation?” — a policy, i.e., RAG.

Error mode. Without routing, the system searches for the vacation policy when a single database query is needed — or conversely, tries to generate data when it should be searching for a rule.

5. Query rewriting — reformulating the query

Queries written by users are often too short, too vague, or use the wrong words. Rewriting reformulates them into a form more suitable for searching—adding key terms and clarifying them. This is important because good retrieval often isn’t about a “smarter index,” but a “better query.”

Corporate example. “Can’t log in” is rewritten into searches such as “SSO token expired,” “password reset loop,” “account locked”—and suddenly the results become relevant.

Error mode. There is a semantic gap between the question written in the user’s language and the language of the document. Without rewriting, the system does not find what the user is looking for—but rather what most closely resembles the text of the query.

6. Multi-query — multiple perspectives, multiple searches

You don’t create a single rewritten query, but multiple ones from different perspectives. This is important because a single query searches from a single perspective—and the relevant document often describes the answer using different vocabulary and from a different angle.

Corporate example. “Why is retention declining?” (customer retention) is run as separate queries on support tickets, NPS comments, and release notes—and together, these paint the full picture.

The problem. A single query yields a single list of results. If the answer lives in multiple silos, you’ll never find it all at once.

7. RAG fusion — merging result lists

If multiple searches yield multiple result lists, fusion merges them so you don’t rely solely on the strongest list. This is important because the “best answer” often requires merging several moderately good results.

Corporate example. Separate result lists for tender documents, internal architecture documentation, and risk logs are merged—and this makes the answer “complete.”

Failure mode. Without fusion, the system relies on the top results from the strongest list. Moderately strong but critically important results are left out.

8. Reranking — Re-evaluating Relevance

The first round of retrieval is a rough filter—it returns many results, not all of which are relevant. The reranker is a second model that reorders the results based on their actual relevance to the query. The top k results become concrete evidence.

Corporate example. When searching for “HR home office policy,” reranking prioritizes exceptions and the approval chain, not the introductory narrative.

Error mode. Without reranking, the model generates based on the very first result—which isn’t necessarily the most relevant, just the most similar in the embedding space.

9. Context Compression — Protecting Working Memory

You don’t feed the entire chunk to the model; instead, you extract the sentences relevant to the question. This is crucial because the context window is a narrow attention channel, and the model can very easily draw the wrong conclusions from the noise.

Corporate example. During an audit, you only need the specific steps from a control description—the rest is “fluff” that leads you astray, and compression cuts this out.

Error mode. A context window filled with entire chunks = attention overload. The model does not focus on relevant sentences, but on the most recent or loudest information—recency bias (the tendency to rely on the last impression) and noise cause the most hidden errors here.

III. Goal Selection and Problem Solving — Intelligent Search

The third layer is the system’s “executive function”—known as executive function in cognitive science. It does not ask, “What did I find?” but rather, “How should I search, in how many steps, and in what order?” A single search is sufficient for simple questions. Complex questions require multiple steps, multiple sources, and multiple strategies.

10. Citation-Oriented RAG — Verifiability

The system works by attaching specific source excerpts and citations to important statements. This is essential because, in a corporate environment, an answer is only valuable if it is verifiable and open to discussion.

Corporate example. A legal assistant states the terms of termination and includes the relevant clause and paragraph from the contract.

How it goes wrong. Without a citation, the system “makes a statement”—and the user has no way of knowing where the claim comes from. Trust erodes, and the system loses its credibility.

11. Multihop RAG — multi-step search

It breaks the question down into sub-steps, retrieves information from multiple sources separately, and then assembles the answer. This is important because many business questions are actually multiple questions in a single sentence.

Corporate example. “Which suppliers had SLA violations, and how much in penalties did we pay?” — It searches separately in the SLA incident log and financial payments, then compiles the answer.

Failure mode. A single search attempts to answer the complex question from a single source — and returns either the SLA data or the financial data, never both.

12. Iterative RAG — search-write-correction cycle

The system doesn’t just search and write once; it works in a cycle: search, write, check, search again, refine. This is important because missing details often only come to light while formulating the answer.

Corporate example. An incident report is being generated; the model writes the narrative, then realizes that the rollback time is missing—it searches for it again and only then finishes the report.

Failure mode. Single-pass RAG is locked into the context loaded at the beginning of the response. If it turns out during the writing of the response that something is missing, there is no way to correct it.

IV. Search Strategies — Expanding the Semantic Space

This section focuses on how to broaden the scope of a search and how to bring the hidden knowledge of the corpus to the surface. The cognitive analogy: associative thinking and creative problem-solving — when the system does not search along the obvious path, but tries detours.

13. Active RAG — uncertainty-aware search

During generation, it searches again at signs of uncertainty—typically where it would start to “guess.” This is important because it means you’re not locked into the context loaded at the beginning of the response, but instead seek evidence where risk actually arises.

Corporate example. In a financial summary, it is uncertain about a number or date—so it goes back to the source before stating it.

Error mode. The passive system makes confident assertions even when it shouldn’t. Active search identifies precisely the moment of “guessing”—and requests new evidence there.

14. Self-RAG — Metacognitive Decisions

The system makes metacognitive decisions: whether retrieval is needed, whether the evidence is sufficient, whether the context is appropriate—and adjusts its steps accordingly. This is essential because it reduces confident errors and also optimizes costs. Metacognition (thinking about thinking) rises to the level of the system.

Corporate example. In the case of “Explain EBITDA,” it does not retrieve information because it is a well-known concept. In the case of “What are our company’s cost accounting rules?”, it does, because internal consistency is required.

Error mode. Without metacognition, the system either always searches (expensive and noisy) or never searches (hallucination). Self-RAG asks itself the question “Is it worth searching?”

15. Corrective RAG — Error Detection and Correction

There is a layer that recognizes when retrieval goes wrong—and corrects it when that happens. It searches again, chooses a different index, or asks a question. This is essential because retrieval errors will always occur—the question is whether you notice them.

Corporate example. A support bot returns an article for the wrong product version; the corrective layer spots the version number discrepancy and searches again based on the version tag.

Error mode. Without correction, the system generates results from the incorrect hit. The user receives the wrong answer and doesn’t know why—because the system didn’t know either.

16. HyDE — Hypothetical Document-Based Search

HyDE (Hypothetical Document Embeddings) is a surprising strategy: the system first generates a “hypothetical ideal document” and uses it to search the corpus. This is important because it can provide a better “compass” for the search even with weak, layperson’s, or overly brief queries.

Corporate example. New internal concept, little documentation—yet HyDE still finds the relevant presentations and meeting notes because the idealized document is semantically closer to the target than the user’s brief query.

Failure mode. The hypothetical document can mislead the search if the generated “ideal answer” itself is fabricated. This technique is therefore not suitable for everything—but in specific cases, it improves performance dramatically.

V. Structure and Knowledge Representation — The Architecture of Memory

The fifth layer is no longer about individual searches, but about how you organize knowledge as a whole. The cognitive analogy: the semantic network and the conceptual hierarchy—knowledge is not a flat list, but an organized system in which levels of abstraction, relationships, and perspectives build upon one another.

17. Hierarchical RAG — Multi-level Abstraction

Knowledge does not exist at a single chunk level, but at multiple levels of abstraction: sentence, paragraph, chapter, document summary. This is important because, when dealing with long texts, you need both the big picture and specific details at the same time.

Corporate example. For a tender spanning hundreds of pages, you first generate a chapter-level summary, then drill down to the specific requirement.

Common pitfalls. Searching at a single level of abstraction either yields overly general results (if searching at too high a level) or gets lost in the details (if searching at too low a level).

18. Summary Tree — A Structural Map of the Corpus

You build a “summary tree” from the corpus, and retrieval fetches content either from the leaves (specific text segments) or from internal nodes (summaries). This is important because the question “what’s the point” is often a structural question, not a sentence-level question.

Corporate example. A management brief doesn’t require a single paragraph, but rather a “master framework” composed of multiple documents—which the tree’s nodes represent well.

Wrong way. Without a summary tree, the executive’s question (“What’s the status of the Q3 projects?”) yields specific sentences instead of constructing the answer from overview-level nodes.

19. GraphRAG — a network of entities and relationships

You treat knowledge as a network of entities and relationships, and retrieval involves both graph traversal and focused extraction. This is important because many corporate questions are relationship-based: who is connected to whom, what is the consequence of what—and it’s difficult to reliably retrieve this information from plain text proximity alone.

The knowledge graph is one of the most promising forms of corporate knowledge management because it answers not only the question “what is in it,” but also the question “what is connected to what.”

Corporate example. In the case of “Why is the project behind schedule?”, meeting minutes, Jira dependencies, decision logs, and risk lists together reveal the causal chain.

Flaw. Vector search alone does not find causal relationships. “The project is behind schedule because person X was removed from task Y because decision Z was made”—only a graph reveals this.

VI. Context and Adaptation — The System’s Sense of Time

The sixth layer concerns how the system handles time, user context, and multiple data sources. The cognitive analogy: episodic memory and context-dependent recall—the brain does not retrieve the same information in every situation, but selects based on the situation, time, and person.

20. Data-augmented RAG — not just documents, but data too

The system not only searches for documents but also retrieves data: CRM, ERP, ticket systems, databases, and calculations. This is essential because many questions are not about “knowledge” but about “current status.”

Corporate example. “What’s the next step with which client?” — not a wiki, but the CRM pipeline and the latest email thread combined.

Failure mode. If the system answers only from documents, it provides outdated or irrelevant answers to “current status” type questions.

21. Memory-oriented RAG — integration of short- and long-term memory

You manage the short-term conversation memory (session context), the user profile (user context), and the long-term document knowledge (corpus) separately—and control which one matters when. This is important because a good answer is often the intersection of stable knowledge and the current situation, and the system can easily confuse the two.

Corporate example. The same policy may vary by country, business unit, or role—without the profile and session context, you’ll apply the wrong rule.

Error mode. The system “forgot” who it’s talking to. The response is technically correct, but meaningless in the user’s context.

22. Temporal RAG — Time as a Dimension

You retrieve data in a time-sensitive manner: you handle versions, expiration dates, recency, and the distinction between “then” and “now.” This is important because the most common mistake in companies is returning an old rule as if it were currently valid.

Corporate example. When asked, “What is the travel policy?” the system returns the latest, valid version—not last year’s PDF.

Error mode. Without temporality, the system cannot distinguish between information that was “valid at some point” and information that is “currently valid.” The user makes a decision in 2025 based on a 2024 rule—and does not realize they have received outdated information.

23. Multimodal RAG — Evidence Beyond Text

Evidence can include not only text, but also images, diagrams, tables, PDF figures, and even audio. This is important because a great deal of “real-world information” is visual or structured, and becomes distorted when converted to plain text.

Corporate example. In manufacturing, the system searches for similar cases and the corresponding corrective actions based on a photo of a defect.

Defect mode. If the system searches only for text, visual evidence—such as photos of components, process diagrams, and dashboard screenshots—remains invisible.

VII. Governance — the system’s immune system

The final layer is the “immune system” of the enterprise RAG system. This is not an optional add-on—it is this layer that distinguishes a hobby project from a production system. In cognitive architectures, this is executive control and impulse inhibition: what not to say, who not to show it to, when to stop.

24. Security and Governance RAG

Authorization, data protection, logging, auditing, prompt injection (malicious input) protection, source authentication—all in a single integrated layer. This is important because in a company, the most costly mistake is often not the error itself, but information ending up in the wrong place.

Corporate example. Two people receive two different answers to the same question because they have access to different documents — and the system doesn’t “hide” this afterward, but retrieves the results this way from the start. Access rights are not a filter on the answer, but a filter on the search.

Error mode. Without governance:

Someone without access can obtain financial data
The audit does not show who asked what and from which source they received the answer
Prompt injection bypasses security, and the system reveals internal documents
The system is “smart”—but unpredictable and uncontrollable

This layer in an enterprise RAG system is not a “nice to have”—it is the license to operate. Without it, the system must not be deployed in a production environment.

How do you combine the 24 patterns in practice?

The 24 patterns are not “RAG types” from which you choose. They are combinable cognitive capabilities, each of which addresses a typical failure mode: bad source, noise, multiple steps, time and version, confident error, access risk.

The logic of development is not fine-tuning, but adding cognitive functions:

First: Access and relevance — hybrid search and multiple indexes to draw from the right knowledge base and avoid mixing domains
Next: Attention and noise reduction — reranking and compression to turn top candidates into actual evidence
Then: Problem decomposition and multi-step processing — routing, rewriting, fusion, and multihop to prevent the system from attempting to solve complex questions in a single step
Finally: Control and self-correction — when the system knows when to search again, when to stop, and when to conclude with a citation

Enterprise AI doesn’t fail where the models do, but where memory and governance intersect. RAG is an external nervous system that must know what to admit into working memory, when to request new evidence, and to whom it can reveal it. The future lies not in “bigger models,” but in a better-designed cognitive stack.

Key Ideas

RAG is not a pipeline—it is a layered cognitive architecture in which memory, attention, execution, and metacognition appear as separate functions
The 24 patterns are not mutually exclusive “types,” but combinable capabilities that you choose based on error modes
The cognitive architecture parallel (ACT-R, SOAR) is not a metaphor—it is a design framework that brings decades of research experience to RAG design
Governance is not optional—this layer is what distinguishes a hobby project from a production system
The model is not the system—the layers are the system. The model is just neon lighting. The stack is consciousness.

FAQ

What is RAG, and why isn’t an LLM enough on its own?

RAG (Retrieval-Augmented Generation) means that the language model searches a knowledge base for relevant documents before responding. On its own, an LLM responds “from memory”—which means it knows nothing beyond its training data and is prone to hallucinations (convincingly presenting non-existent facts).

Which pattern should you start with?

Always with naive RAG (Pattern 1). This is a diagnostic tool—it shows whether your corpus, chunking, and base search are working. If this doesn’t work, nothing built on top of it will work either.

Do you need all 24 patterns?

No. You choose the patterns based on error types. If your system doesn’t struggle with temporal errors (old vs. current rules), you don’t need to implement the temporal RAG. But if the need for auditing is strong, the governance layer is not optional.

What is the difference between Self-RAG and Corrective RAG?

Self-RAG decides in advance: “Should I even search?”—it evaluates the need at a metacognitive level. Corrective RAG detects the issue after the fact: “The search yielded a bad result”—and searches again to correct it. The former is proactive; the latter is reactive. You need both.

How do I start building an enterprise RAG?

Naive RAG → hybrid search → routing → reranking → citation → governance. In that order. Measure the error rate at every step—and make the next layer the one that eliminates the most common error mode.

How to Use Prompts (for Advanced Users) — A prompt is not a request—it’s reality calibration. The layers of RAG also operate using prompts.
The Age of Systems-Level Thinking — RAG’s cognitive stack requires systems thinking — not component optimization.
The Flaws of the Management Matrix — The organization’s nervous system and the RAG’s nervous system struggle with the same flaws.

Zoltán Varga - LinkedIn Neural • Knowledge Systems Architect | Enterprise RAG architect PKM • AI Ecosystems | Neural Awareness • Consciousness & Leadership The model is neon. The stack is the mind. Build the layers — or hallucinate.

Strategic Synthesis

Identify which current workflow this insight should upgrade first.
Set a lightweight review loop to detect drift early.
Close the loop with one retrospective and one execution adjustment.

Next step

If you want your brand to be represented with context quality and citation strength in AI systems, start with a practical baseline and a priority sequence.

Start with AI Scorecard Browse Hungarian originals