Field NoteApril 26, 2026 · 8 min read

Enterprise RAG is Not Demo RAG

RAG is not a feature you add. It is a system you operate. Here are the five things that separate the prototype that wowed the executive team from the production deployment that's still working in eighteen months.

Every mid-market AI initiative we engage with has a RAG component now. The pattern is consistent: the team built a demo, indexed some documents, asked it questions, watched it answer plausibly, and tried to ship it. Within months, the deployment has plateaued. Adoption is dropping. Legal has questions. The team running it is afraid to change anything because they have no way to detect whether their changes will make it better or worse.

The deeper point is this: RAG is not a feature you add. It is a system you operate. Treating it as a feature produces the demos that got companies excited about AI in the first place. Treating it as a system is what produces the deployments that are still working in eighteen months.

This is not a small distinction. Industry analysis of 2026 enterprise deployments puts the RAG failure rate at over 70%. Most failures are not technology problems. They are operating model problems — the prototype shipped without the architecture around it that enterprise reality demands. Five things separate enterprise RAG from demo RAG, and each is independently capable of breaking a deployment.

1. Permissions are not a query-time afterthought

The first place demo RAG meets enterprise reality is permissions. The demo retrieves over a single index where every user sees every document. This is fine in a sandbox. It is catastrophic in production, where a finance VP must not see HR documents, where the West region must not see East region pricing, where a contractor must not see anything beyond their scope.

The instinct is to add a permissions filter at query time. This works for the simplest cases. It fails the moment your source systems have permissions models more complex than a flat access list — which is to say, immediately. SharePoint has site-level, library-level, item-level, and inherited permissions. Drive has shared-with, shared-with-group, organization-wide, link-with-permission. Confluence has space restrictions, page restrictions, and inherited overrides. Each of these has to be captured during indexing, refreshed when the source changes, and applied at retrieval time.

This is the single most under-budgeted layer of every RAG project we evaluate. It is also the most legally consequential. The demo does not show you this problem. The first audit does. Building it correctly from the start is dramatically cheaper than retrofitting it after a permissions incident.

2. Chunking is a content-type decision, not a parameter

In the demo, chunking is a single setting: split each document into 500-token pieces with some overlap. In production, chunking is a content-type-aware decision that varies by source.

A policy document with structured sections chunks differently than a meeting transcript. A wiki page with heading hierarchy chunks differently than a PDF with embedded tables. A code file should not be chunked by token count at all — it should be chunked by function or module boundary, or it produces useless retrievals.

A clinical decision support study published in 2025 found that adaptive chunking achieved 87% accuracy versus 13% for fixed-size baselines on the same corpus. That is not a marginal difference. It is the difference between a system that works and one that does not.

Get chunking wrong and the symptom is subtle: retrieval returns the right document but the wrong chunk, or returns three chunks that together would have answered the question if any one of them had been retrieved alone. The user gets a confidently-worded wrong answer. The demo never shows this because the demo chunks five PDFs and asks questions about specific paragraphs.

3. Hybrid search is not optional

Pure vector search — the default in most demo systems — is wrong often enough that it is not a production strategy.

Vector search excels at semantic similarity. It struggles with proper nouns, product codes, version numbers, error codes, and any content where the user's query is the exact string they are looking for. A user asking "what is the policy on EXP-447 returns" wants documents containing the literal string EXP-447. Vector search will return documents that are semantically similar to "policy on returns" and may rank the document containing EXP-447 below documents that do not. This is wrong. It is also unfixable by tuning the embedding model.

The fix is hybrid search: semantic vector retrieval combined with traditional keyword search, then merged and ranked by a dedicated reranking model. Both retrievers run in parallel; the union is reranked by a model purpose-built to assess relevance against the original query. This is now standard practice across enterprise deployments, and the reranking step matters at least as much as the retrieval step. Skipping it is a common shortcut that produces a system returning plausibly-relevant documents instead of the right document.

4. Evaluation is the production system

This is the layer that most distinguishes serious RAG operators from everyone else.

Demo RAG is evaluated by asking it questions and reading the answers. This is not evaluation. This is vibes. Production RAG requires a measurable evaluation harness with curated test sets, quality metrics, and regression detection — and that harness has to run continuously, not just once at launch.

The metrics that matter for enterprise deployments are well established at this point. Faithfulness measures whether the answer reflects the retrieved context or hallucinates beyond it. Answer relevance measures whether the answer actually addresses the question asked. Retrieval precision measures whether the chunks returned were actually relevant. Each of these can be measured automatically and tracked as a time series. Industry guidance now sets faithfulness thresholds at 0.85 or higher for customer-facing applications and 0.9 or higher for regulated industries — meaning answers should be supported by retrieved context at least 85 to 90 percent of the time.

The first time a model upgrade, a prompt change, a corpus update, or a chunking strategy revision degrades any of these metrics, the team should know within a day. Without continuous evaluation, model drift goes undetected for weeks. By the time users complain, trust is already gone — and trust is asymmetric. Lose it once, and it takes ten right answers to rebuild.

The honest version: most of the RAG systems mid-market teams describe to us in conversation have no evaluation harness. The team is afraid to change anything because they have no way to detect regression. The system has been frozen at its launch state for nine months while the world has moved on.

5. The corpus is alive

Demo RAG indexes a snapshot. Production RAG indexes a moving target.

Documents change. New documents arrive. Old documents get archived. Permissions shift as people change roles. A policy document gets superseded by a new one, but the old version is still in the corpus, still gets retrieved, still produces answers based on outdated rules.

The architecture has to handle this: continuous ingestion with change detection at the source rather than periodic full re-indexes; soft delete of removed documents with audit retention; versioning awareness so superseded content is preferentially deprecated; deduplication so near-identical documents do not all retrieve and dilute the result; and freshness signals in the metadata so the model can prefer recent content when the question is time-sensitive.

Each of these is its own engineering problem. None is solved by the typical demo. All of them surface in production within months and produce the same symptom: the AI is confidently citing outdated information.

What this means for an integration assessment

When we run a Phase 1 Integration Assessment with a client who has an existing RAG deployment, the assessment is largely an audit against these five patterns. We map the permissions architecture against the source systems, audit the chunking strategy against the content mix, look at whether hybrid retrieval is in place, ask to see the evaluation harness, and check how the corpus stays current.

The findings are usually consistent. The system was built as a demo, shipped as production, and has degraded steadily since launch. Adoption is dropping because the answers are getting worse. No one has a measurement framework that tells them how much worse, or why. The fix is rarely a tooling change. It is an architectural one, and it requires rebuilding the system around the production patterns from the start.

The build layer of RAG is becoming commoditized — Cursor and Claude Code can scaffold a basic RAG pipeline in a couple of hours. The integration layer, the operations layer, the governance layer — none of those are commoditizing. That is where the work is. That is where we focus.

Iron Pine helps mid-market companies integrate AI into how they actually operate — grounded in your data, embedded in your workflows, adopted by your people, and operated with production discipline.

Talk to us about an Integration Assessment · Try the AI Health Check

Iron Pine helps mid-market companies integrate AI into how they actually operate — grounded in your data, embedded in your workflows, adopted by your people, and operated with production discipline.

Talk to Us About an Assessment Try the AI Health Check