The naive pipeline that everyone builds first
A team gets a new AI feature on the roadmap. The first version of the pipeline is almost always the same shape. Take whatever the user is asking about — a transcript, a document, a customer history, a codebase — drop the whole thing into the prompt, append the question, send it to the model, return whatever comes back. It works in the demo. It works on the first few real inputs. Then it stops working, and the team cannot tell why.
The failure modes are predictable once you have seen them a few times. The context window fills with material that has nothing to do with the question, and the model latches onto whatever sounds most relevant — which is not always what is actually relevant. Latency climbs because every request ships tens of thousands of tokens. Cost climbs faster than usage. Hallucinations appear not because the model is bad, but because the model is being asked to find a needle in a haystack while pretending the haystack is the answer.
The fix is architectural, not prompt engineering. The systems that hold up in production retrieve the relevant context first, and only then ask the model to reason over what was retrieved. The principle is short enough to put on a wall: retrieve first, reason second.
Why naive prompting fails at scale
The temptation to stuff everything into a prompt is rational on day one. Context windows keep growing. A million-token window feels like enough room for any document. Why not just send the whole thing?
Three reasons, and they compound.
Token economics. Every token costs money on input and again on output. A pipeline that sends a 400,000-token transcript to answer a thirty-token question is paying for a hundred-thousand-fold mismatch between the work and the bill. On low-volume internal tools, nobody notices. On a product with real traffic, the AWS console becomes a daily anxiety.
Attention dilution. A model can technically process a million tokens. That does not mean it weights them well. Long-context models are demonstrably worse at recall in the middle of the context than at the edges — the well-known "lost in the middle" problem. The more irrelevant content surrounds the answer, the more the model's effective attention is spread, and the lower the probability that it pulls the right detail out.
Hallucination surface area. A model asked a precise question over a broad corpus will sometimes invent a precise answer. It is not lying — it is pattern-matching across loosely related material and synthesizing what feels coherent. Narrowing the input is the cheapest, most reliable way to narrow the surface area for invention.
The naive pipeline trades engineering effort for compute. That trade looks favorable when traffic is in the dozens of requests per day. It inverts hard once usage grows, accuracy starts to matter, and someone in finance asks why the inference bill doubled.
The architecture: a retrieval layer between the user and the model
The pattern is two layers separated by a clean interface.
The retrieval layer owns "what is relevant to this question." It receives the user's query, expands it into something useful for search (an embedding, a set of filters, sometimes a rewritten version of the question), and pulls back a small number of high-quality chunks from a vector store. It is dumb in the right way: it does not try to answer, it tries to find.
The reasoning layer owns "what does this material say about the question." It receives the user's original query plus the retrieved chunks, and produces a structured answer grounded in that material. It is smart in the right way: it does not need to remember the entire corpus, because the retrieval layer has already done the remembering.
Concretely, on AWS, the components line up like this:
- Embeddings come from Titan or Cohere via Bedrock. The transcript or document is chunked once at ingestion, each chunk is embedded, and the vectors are stored.
- Vector storage and search live in OpenSearch (or pgvector for smaller corpora). Each vector carries metadata: source ID, timestamps, document type, section, access permissions.
- Reasoning is Claude on Bedrock, receiving the retrieved chunks and a prompt that says, in effect, "answer the question using only the material below, and cite where each claim comes from."
The interface between the layers is a list of chunks. Nothing else crosses the boundary. This matters because it lets the two layers evolve independently — you can swap embedding models, change reranking strategies, or upgrade the reasoning model without breaking the contract.
What this buys you in practice
The shift from "send everything" to "retrieve, then reason" produces effects that are easy to underestimate in advance.
Costs collapse. A retrieval-based pipeline typically sends 2,000 to 8,000 tokens to the model per request, regardless of how large the underlying corpus grows. The bill scales with the number of questions asked, not with the size of the knowledge base. We have seen identical features re-architected from naive to retrieval-based and drop 80 to 95 percent of their per-request cost.
Accuracy goes up, not down. This surprises people. The intuition is that giving the model less context should produce worse answers. The opposite happens, because what the model receives is dense with signal rather than diluted with noise. The retrieval layer's job is to filter, and filtering is exactly what improves attention.
Latency stops scaling with corpus size. Vector search returns results in milliseconds, even at hundreds of thousands of chunks. Model inference time is dominated by output length, not by the size of the corpus the retrieval layer searched against. A pipeline that takes 1.4 seconds at 10,000 documents takes 1.6 seconds at 500,000.
Citations become possible. Because the reasoning layer only sees retrieved chunks, you can require it to cite which chunk each claim came from. The retrieval metadata travels with the result, so the final answer can carry source IDs, page numbers, or video timestamps. This is the difference between an AI feature that legal accepts and one they block.
Debugging gets dramatically easier. When a naive pipeline gives a wrong answer, the team has nowhere to look except the prompt. When a retrieval-based pipeline gives a wrong answer, you can inspect the chunks that were retrieved. If the right chunk was retrieved and the model still answered wrong, the problem is in reasoning. If the right chunk was not retrieved, the problem is in the index or the embedding. Two failure modes, two different fixes — and you can tell which is which in seconds.
The chunking decision that breaks most implementations
The retrieval layer is only as good as its chunks, and chunking is where most implementations quietly fail.
Fixed-size chunking — split every 512 tokens — is the easiest thing to build and the worst thing to live with. It will split a single thought across a chunk boundary, embed half of an explanation in one vector and half in another, and then retrieve neither when the question arrives. The model gets a fragment, asks for context that is not there, and the failure looks like a model problem when it is actually an indexing problem.
The patterns that hold up in production share three properties.
Chunks respect semantic boundaries. A paragraph, a section, a clause, a transcript segment that covers one topic. The chunker uses the structure of the source — headings, paragraph breaks, speaker turns, table rows — rather than counting tokens.
Chunks overlap, modestly. A small overlap (15 to 25 percent) between adjacent chunks gives the retrieval layer redundancy. A concept mentioned at the boundary of one chunk is still present at the start of the next. This is the cheapest hedge against bad boundary decisions you will ever buy.
Chunks carry metadata that the model never sees but the retrieval layer uses. Source document, page number, section heading, document type, video timestamp, last modified date, access permissions. The retrieval layer filters on this metadata before doing vector search — an HR question only searches HR documents, a question about a specific video only searches that video's chunks — and the reduction in candidate space is where retrieval precision is actually won.
For video and audio, the chunk schema is just an extension of this idea. A transcript chunk carries start and end timestamps alongside its text. The retrieval layer returns the chunks; the reasoning layer reads them; the response carries the timestamps through to the user as a clip range. The shape of the chunk is doing more work than the shape of the prompt.
When retrieve-first does not apply
Like any pattern, retrieve-first has a domain of usefulness. There are cases where dumping everything into the prompt is correct.
The input is small and bounded. A single email, a short customer message, a 300-line code diff. There is no haystack. Retrieval would be ceremony.
The task is generative, not extractive. Writing marketing copy, brainstorming names, summarizing a single document the user uploaded for this specific request. There is no corpus to retrieve from.
The model needs to reason across the whole input simultaneously. Cross-document comparison where the question is "find the contradictions between these five documents." Retrieval picks pieces — but the question is about the whole.
The rule of thumb is the same one architects use for caching: the moment your input grows faster than the question, you want retrieval. The moment retrieval becomes more code than it saves, you do not.
The principle that survives every model upgrade
What makes retrieve-first durable is that it survives changes in the underlying model. When Claude's context window grew from 100K to 200K to 1M, retrieval-based systems did not need to change. The retrieval layer kept doing its job, the reasoning layer got a more capable model, costs stayed flat, and accuracy went up.
The teams that bet on "the next model will have a big enough window to just send everything" have been right about the window and wrong about the economics. The window grew. The cost per token did not fall fast enough to make naive prompting cheap. The latency profile of large-context inference stayed worse than the latency profile of retrieve-then-reason. The bet that the architecture would become unnecessary did not pay off.
LLMs are reasoning engines, not databases. The minute you start treating them like databases — by stuffing context into prompts and hoping the model remembers — you are paying for one thing and getting another. The architecture that respects that distinction is the one that scales. Retrieve first. Reason second. Cite always. That is the shape of an AI feature that ships, holds up, and pays for itself.
