Series · Amazon Bedrock for Production AI · Part 2 of 8 ← Part 1: Foundations · RAG with Bedrock Knowledge Bases · Part 3: Open-source Agent Frameworks on Bedrock →
Most production Bedrock work is RAG
If you build five production Bedrock workloads in 2026, four of them will be retrieval-augmented generation in some shape. The agent that answers customer questions from a policy archive. The internal copilot that reasons over the company's documentation. The compliance assistant that retrieves regulatory text. The data-room agent for due-diligence work. The technical-support agent that grounds answers in product documentation. In every case, the same pipeline: documents on one end, an answer with citations on the other.
The pipeline has five architectural gates: source storage, chunking, embedding, vector storage, retrieval. Each gate is a real engineering decision with real cost and quality consequences. Most production RAG failures we see in the field are not failures of the model — they are failures of one of these five gates upstream of the model. The model is asked to reason over the wrong context, retrieved by the wrong method, indexed at the wrong granularity, with embeddings produced by the wrong model. The Claude session looks competent; the answer is wrong.
This piece is the working playbook for those five gates on Amazon Bedrock, and the multi-model topology that makes the whole pipeline cost-defensible in production.
The reference architecture
The shape of a Bedrock-backed RAG pipeline:
Two things the diagram makes explicit:
- The pipeline is multi-model by design. The embedding model is not the reasoning model. Cohere Embed v3 or Amazon Titan Embeddings v2 produces the vectors at ingestion and at query time; Claude 4.x reasons over the retrieved chunks. The cost ratio is roughly 20-50× per token between embedding and reasoning — using Claude for embeddings would be technically possible if it exposed an embeddings API (it doesn't) and economically indefensible if it did.
- The vector store is the architectural decision with the longest tail. Switching embedding models means re-embedding the corpus (real cost, manageable). Switching vector stores means re-ingesting everything, re-indexing, re-tuning hybrid search, re-validating the retrieval harness (real cost, often months of work). Choose deliberately.
Gate 1 — Source storage and document preparation
Bedrock Knowledge Bases ingest from Amazon S3 (and increasingly from Web Crawlers, Confluence, Salesforce, SharePoint, and custom data sources). The S3 path is the production default.
What lives in the bucket matters as much as what's in the documents. The defensible posture:
- One bucket per domain, not per agent. Mixing customer-facing FAQs and internal engineering runbooks in the same bucket is the simplest way to surface internal documents to external users. Per-domain buckets with per-domain IAM policies are the perimeter.
- KMS Customer Managed Keys, Object Lock, versioning. The corpus is sensitive by default. CMK encryption gives per-bucket key rotation; Object Lock gives auditable retention; versioning gives rollback when a bad ingestion replaces a good document.
- Source format discipline. PDFs are the worst format for RAG (poor layout reconstruction, OCR errors on scans, tables badly parsed). Markdown, HTML, and DOCX retrieve better. Where PDFs are unavoidable, pre-process through Amazon Textract or Unstructured.io before landing in the source bucket — turn the PDF into clean markdown plus a JSON metadata sidecar.
- Metadata sidecars per document. A
<filename>.metadata.jsonfile alongside each document declares attributes —source,published_date,classification,language,owner_team. The Knowledge Base ingestion respects these and exposes them as filterable retrieval metadata. Without metadata you cannot do filtered retrieval, which means you cannot do per-tenant or per-language or per-recency queries.
The work at this gate is one-time per corpus, but it determines the ceiling on every downstream gate's quality.
Gate 2 — Chunking strategy
A 200-page document is not a unit of retrieval. The chunking strategy splits documents into the units the embedding model can vectorise and the retrieval layer can return. Bedrock Knowledge Bases support four strategies; the choice matters enormously for retrieval quality.
| Strategy | What it does | When it's right |
|---|---|---|
| Fixed-size | Splits into N tokens (default 300) with M token overlap (default 60) | Default for most homogeneous text corpora. Predictable, fast, well-understood. |
| Hierarchical | Two-level chunking: small chunks for retrieval, large parent chunks returned to the model | Documents with deep hierarchy (legal contracts, regulatory text, long-form documentation). Retrieval is precise; context is rich. |
| Semantic | Splits at semantic boundaries detected by an embedding model | Heterogeneous corpora where fixed-size cuts mid-thought. Higher ingestion cost; better answer quality. |
| None (raw) | Treats each document as one chunk | Only for tiny documents where the whole thing fits in the model's context. Rare in production. |
The defensible default for unfamiliar corpora is fixed-size 300/60. Tune from there based on the retrieval harness output (more on the harness below). Hierarchical chunking is the right move for legal, regulatory, or policy-heavy corpora where the parent chunk's context disambiguates the child chunk's meaning. Semantic chunking is worth the ingestion cost when fixed-size obviously fragments meaning (e.g., a knowledge base of cooking recipes where instructions and ingredients run together).
Two operational rules:
- Chunk size is not a "set it and forget it" parameter. Re-chunk and re-embed when the retrieval harness's hit rate falls below the threshold the use case requires. Quarterly re-evaluation against the harness is the cadence.
- Overlap matters for boundary continuity. Zero overlap loses meaning at chunk boundaries; excessive overlap inflates the corpus size linearly. 15-25% overlap is the working band.
Gate 3 — Embedding model selection
The embedding model converts each chunk into a vector. The choice has three real consequences: vector dimension (which determines storage cost), per-token embedding cost (which determines ingestion cost), and retrieval quality (which determines whether the right chunks come back for the model to reason over).
The three credible options on Bedrock:
| Model | Dimension | Strengths | Trade-off |
|---|---|---|---|
| Cohere Embed English v3 | 1024 | Strongest English retrieval quality in the catalogue; supports query-vs-document distinction (different vectors for queries vs documents); built-in re-ranker available | English-first; multilingual variant exists but is weaker than the English version |
| Cohere Embed Multilingual v3 | 1024 | Strong across 100+ languages; right default for Nigerian / pan-African / EU multilingual corpora | Slightly weaker than English-v3 on English-only corpora |
| Amazon Titan Embeddings v2 | 256 / 512 / 1024 (configurable) | AWS-native; flexible dimensions for cost tuning; multilingual; competitive on most benchmarks | Marginal quality gap vs Cohere on English retrieval benchmarks |
The defensible choice: Cohere Embed v3 for production retrieval quality, Titan Embeddings v2 for cost-sensitive workloads where the dimension flexibility justifies the marginal quality trade. Where Nigerian / African / European multilingual support matters, Cohere Multilingual v3 is the option.
The reasoning model is not the embedding model. Worth restating: Claude does not provide an embeddings API on Bedrock. The reasoning step is Claude; the embedding step is Cohere or Titan. This separation is what makes the pipeline cost-defensible.
Gate 4 — Vector store selection
Vector store choice has the longest reversal cost of any decision in the pipeline. The matrix:
| Store | Best for | What it costs you |
|---|---|---|
| Amazon OpenSearch Serverless | Default for AWS-native RAG; managed; integrates natively with Bedrock KB; hybrid search built-in (BM25 + k-NN); scales independently of indexing | Higher per-OCU pricing than self-managed; minimum 2 OCUs per collection (~$700/month baseline) |
| pgvector on Aurora / RDS Postgres | When the application already has Postgres; when SQL filtering of vectors matters; when scale is moderate (< 10M vectors) | More operational overhead than serverless; manual tuning of HNSW or IVF indexes; query latency increases above ~5M vectors |
| Pinecone | When the team has Pinecone expertise; serverless tier is cost-effective for small-to-medium workloads; strong managed UX | External vendor (not AWS-native); data egress considerations for sensitive workloads; vendor lock-in |
| MongoDB Atlas Vector Search | When the application already uses MongoDB; native integration with document data | Atlas-only (not local MongoDB); per-cluster pricing model |
| Redis Enterprise Cloud | When sub-millisecond retrieval matters (real-time recommendation, conversational latency-critical apps) | Higher per-vector cost than alternatives; not the default for typical RAG workloads |
The defensible AWS-native default is OpenSearch Serverless, with pgvector on Aurora as the cheaper alternative for moderate-scale workloads where the application is already on Postgres. Pinecone is the right answer when the team has existing Pinecone investment and is willing to accept the cross-cloud data flow.
The honest rule: choose the store the team can actually operate. A perfectly-tuned Pinecone deployment beats a misconfigured OpenSearch every time; a well-operated OpenSearch Serverless beats a mismanaged pgvector cluster every time. Operational competence dominates theoretical fit.
Gate 5 — Retrieval mechanics
Retrieval is more than nearest-neighbour vector search. Three layers stack:
Hybrid search — lexical plus vector
Pure vector search misses queries where the right answer has a specific named entity, code identifier, or product SKU. Pure lexical (BM25) search misses queries where the answer is phrased semantically differently from the question. Hybrid search runs both, weighted, and merges results.
Bedrock Knowledge Bases on OpenSearch Serverless do hybrid out of the box — the default retrieval call returns BM25 + vector merged with a configurable weight. The right weight is corpus-dependent: 0.5/0.5 is the default; corpora with rich named entities (technical docs, legal text) skew toward 0.6/0.4 in favour of lexical; conversational corpora skew the other way. Tune via the retrieval harness.
Re-ranking
After the first-pass retrieval (top-k chunks), a re-ranker scores each candidate against the query more precisely. Cohere Rerank is the production-grade option available on Bedrock; it runs over the top-N candidates (typically 25-100) and returns the top-K (typically 3-10) in a refined order. The re-ranker is a different model from the embedder — it sees the query and document together and produces a relevance score, which is more accurate than the cosine similarity the first-pass returns.
The cost: a Cohere Rerank call costs roughly the same as 25-100 embedding calls. The benefit: retrieval precision often jumps 10-30% on hard queries. For most production RAG, re-ranking is worth the cost; the workloads where it isn't are those where retrieval is already near-perfect or those where latency budget makes the extra call infeasible.
Filtered retrieval
The metadata sidecars from Gate 1 are the filter inputs. A query like "what's our refund policy for 2024 purchases" benefits from a filter like published_date >= 2024-01-01 AND classification = 'public'. Filters apply before the vector search, so they reduce the candidate set the embedder has to score, which both lowers cost and improves precision.
Three filters that are nearly always worth having: classification (public / internal / confidential — drives per-tenant access control at retrieval time), language (cuts the multilingual corpus to the user's language), and freshness (last-modified date for time-sensitive queries).
The reasoning step — where Claude lives
After retrieval, the top-K chunks are stitched into the model's context window along with the user's query and the system prompt. This is where Claude does the actual reasoning. Three operational notes:
- Citations are mandatory. The prompt instructs Claude to return each factual claim with the chunk ID it came from. The Knowledge Base API returns
retrievalResults[].locationper chunk; pass these IDs into the prompt and require them in the response. Users see citations; auditors see traceability. Hallucinations become detectable because they have no citation. - Context window discipline. Claude 4.x has very long context windows (200K+ tokens). The temptation is to stuff 50 chunks. The right move is the smallest defensible context — top 3-5 chunks after re-ranking. More context dilutes attention and inflates per-query cost. Use the long context for the single hard document, not for piles of mediocre ones.
- Synthesis prompt structure. A working pattern: brief system prompt with the role; explicit instruction to cite per claim; instruction to say "I don't know" when retrieved context doesn't support an answer; retrieved chunks as numbered references; user query last. The exact prompt is tunable; the structure is durable.
The multi-model topology — why this works
The pipeline is genuinely multi-model. Per query, four model invocations:
- Query embedding — Cohere Embed v3 or Titan v2 — converts the user's query into a vector for retrieval. Cheap, fast.
- Document retrieval — vector store internal — no model invocation here; pure vector arithmetic.
- Re-ranking — Cohere Rerank — scores top-N candidates. Mid-cost.
- Synthesis — Claude 4.x Sonnet (default) — reasons over the retrieved chunks and produces the answer. The expensive step.
For most production RAG, Claude Sonnet is the right tier. Claude Haiku is sufficient for simple factual-retrieval workloads (FAQ lookup, single-fact answers). Claude Opus is overkill except for legal, financial, or scientific synthesis where the cost is justified by stakes.
The cascade pattern, applied to RAG:
| Query type | Routing decision |
|---|---|
| Simple factual lookup | Haiku · single retrieval call, no re-ranking |
| Standard knowledge query | Sonnet · re-ranking on · top 3 chunks |
| Hard synthesis (multi-document, comparative analysis) | Opus · re-ranking on · top 5 chunks · extended thinking enabled |
A router function (often a small LLM call to Haiku itself) classifies the incoming query and routes accordingly. The savings over running every query through Opus are an order of magnitude on the per-query cost without measurable quality regression on the simple-query class. This is the pattern Part 7 of this series unpacks in depth.
Evaluation — the RAG harness
You cannot tune any of the above without a retrieval-quality harness. The shape of one:
- A golden set of 100-500 query/answer pairs from real users (or curated to look like real users). Each pair includes the question, the expected answer, and the source documents the answer should come from.
- Per-query metrics: hit rate (did the right document appear in top-K retrieval?), rank (where in the top-K did the right document rank?), faithfulness (does the model's answer match the source documents — measured by an LLM-as-judge over a sample), citation accuracy (do the citations in the answer point to the chunks that actually support each claim?).
- Aggregated weekly during the tuning phase, monthly in production steady-state. The trendline matters more than any single number — retrieval quality should improve as the harness drives the tuning, and degradations should trigger alerts.
Without this harness, every RAG decision (chunking, embedding model, vector store, weights, top-K, re-ranking on/off) is guesswork. With it, decisions are evidence-driven and the model behind the agent is doing the job it's good at.
Production failure modes worth knowing
Five patterns we see often enough to call out:
- The "looks right but isn't" answer. Retrieval brought back chunks that are topically adjacent but factually wrong for the specific question. The answer reads fluently because Claude is fluent; the answer is wrong because the chunk was wrong. Fix: re-ranking on, citations required, faithfulness eval.
- The empty corpus on a fresh query. A new product or new policy lands in the corpus but the embeddings haven't been refreshed. The agent confidently retrieves the old policy and answers from it. Fix: scheduled re-ingestion + change-detection sync from the source system + freshness metadata as a default filter.
- The cross-tenant leak. Customer A's data appears in Customer B's retrieved chunks because the metadata filter wasn't applied. Fix: classification metadata as a mandatory filter at retrieval time — enforced at the API layer, not the prompt layer.
- The long-document fragmentation. A 50-page contract is chunked at fixed 300-token boundaries; retrieval brings back a chunk from page 17 with no context. Fix: hierarchical chunking with parent-chunk return.
- The PDF-OCR garbage. A scanned PDF's text is OCR'd badly, the embeddings are nonsense, retrieval returns nothing relevant. Fix: pre-process PDFs through Textract or Unstructured.io before ingestion; metadata-flag scanned documents for manual review.
A working Knowledge Base configuration
The Terraform / CDK shape of a defensible production KB:
knowledgeBase:
name: company-policy-kb
embeddingModel: cohere.embed-english-v3
vectorStore:
type: opensearch-serverless
collectionName: policy-vectors
encryptionPolicy: aws/kms-cmk-policy-vectors
networkPolicy: aws/private-vpc-only
dataSource:
s3Configuration:
bucketArn: arn:aws:s3:::company-policy-source-prod
inclusionPrefixes: ["policies/", "faqs/"]
chunkingConfiguration:
strategy: HIERARCHICAL
hierarchical:
levelConfigurations:
- level: 1
maxTokens: 1500
- level: 2
maxTokens: 300
ingestionSchedule:
type: scheduled
cron: "cron(0 2 * * ? *)" # 02:00 UTC daily
retrieval:
hybridSearch: true
bm25Weight: 0.4
vectorWeight: 0.6
rerankingModel: cohere.rerank-v3
defaultTopK: 5
The pieces that matter:
- Cohere Embed v3 for embeddings (production-quality English) with hierarchical chunking (right for policy text)
- OpenSearch Serverless with KMS CMK and private-VPC-only network policy
- Hybrid search with re-ranking on, top-K = 5 (small, after re-rank)
- Daily ingestion schedule for freshness; manual re-ingestion on demand for change events
This Knowledge Base then wires into a Bedrock Agent (as documented in Part 1) by referenced ID. The agent gets the retrieval capability without separate retrieval code; the KB can be updated independently of the agent.
What's next
Part 2 documented the retrieval layer. Part 3 picks up the orchestration layer: when Bedrock Agents' managed loop is the right answer, when AgentCore + Strands is, and when LangChain or LlamaIndex earns its complexity. Each path has its own RAG integration pattern; the KB built here works against all of them.
The full series:
- Part 1 — Foundations: Building AI Agents on Amazon Bedrock
- Part 2 — RAG with Bedrock Knowledge Bases (this piece)
- Part 3 — Open-source Agent Frameworks on Bedrock
- Part 4 — Model Customization on Amazon Bedrock
- Part 5 — Multi-step AI Workflows with Step Functions and Bedrock
- Part 6 — Security Guardrails and Observability for Bedrock
- Part 7 — Cost Optimization on Bedrock (deepest treatment of multi-model routing)
- Part 8 — Case Study: An SRE AI Agent on Bedrock for CloudWatch Log Triage
The Amazon Bedrock series accompanies the Hardening-before-AWS series and the AWS-for-banks architecture series. Both substrates assume the security and identity foundations are in place; this series builds the AI workload on top.
