Series · Amazon Bedrock for Production AI · Part 7 of 8 ← Part 6: Security Guardrails and Observability · Cost Optimization · Part 8: Case Study: SRE AI Agent for CloudWatch →
The cost surface is opaque by default
A team's first Bedrock bill at the end of the first month usually carries one of two flavours of surprise. Either it's smaller than expected — the team has not yet scaled the agent to real production volume and the bill reflects pilot traffic. Or it's larger than expected — the agent went into production, a small percentage of queries hit the longest context windows, and the bill is dominated by the heavy tail rather than the median query.
Either way, the team doesn't know which features, which agents, which prompts, or which workflows are driving the spend. The cost surface is opaque by default. Making it transparent is the first move in cost discipline; the optimisations that follow only matter if the team can attribute spend.
This piece is the deepest treatment of the Claude-first / multi-model routing thread that runs through Parts 1–6. The patterns: tag-driven attribution, the cascade routing pattern in operational depth, prompt and response caching, batch inference for non-real-time workloads, and the provisioned-throughput-vs-on-demand decision tree.
Token economics — what each model costs you, structurally
The first thing to internalise is that Bedrock pricing is per-token, with input tokens and output tokens priced differently (output is typically 4–5× more expensive than input), and per model. The rough relationships across the catalogue, as of mid-2026 pricing:
| Model | Relative input cost | Relative output cost | When it's the right tier |
|---|---|---|---|
| Claude Haiku 4.5 | 1× (baseline) | 1× (baseline) | High-volume / low-complexity: routing, simple classification, factual lookup, summarisation, structured extraction |
| Claude Sonnet 4.6 | ~3-4× Haiku | ~3-4× Haiku | Default production reasoning tier: agent loops, RAG synthesis, code generation, multi-step planning |
| Claude Opus 4.7 | ~12-15× Haiku | ~12-15× Haiku | Hardest tasks only: complex multi-step analysis, novel synthesis, legal / financial / scientific where stakes justify the cost |
| Llama 4 Maverick | ~0.6× Haiku | ~0.6× Haiku | Cost-sensitive workloads where Haiku's quality isn't required |
| Llama 4 Scout | ~0.3× Haiku | ~0.3× Haiku | Very high volume, simple classification |
| Titan Embeddings v2 | ~0.05× Haiku per 1M tokens | (no output) | Embeddings (no output cost — embeddings are vectors, not generated text) |
| Cohere Embed v3 | ~0.05× Haiku per 1M tokens | (no output) | Embeddings — production default per Part 2 |
| Cohere Rerank v3 | ~0.5× Haiku per call | (per-document scored) | Re-ranking the top-N RAG candidates |
(The numbers are illustrative ratios. Pull the current official Bedrock pricing page for the absolute values before quoting them in a proposal.)
What this matrix forces you to internalise: the difference between routing a query to Haiku versus Opus is roughly 12-15×. A workload that runs everything through Opus is paying an order of magnitude more than a workload that routes intelligently. The architectural skill is matching tier to task.
The cascade routing pattern, in depth
The cascade pattern (introduced in Part 2 and threaded through every subsequent piece) is the single highest-leverage cost optimisation available. The mechanics:
The three operational decisions inside the cascade:
1. The router itself
The classifier is typically a small model — Claude Haiku is the default. Cost per routing decision: roughly $0.001 per query. The classifier's task is narrow: take the incoming query and return one of three labels (simple / standard / complex). A 50-word system prompt and a structured output format is sufficient.
def classify_complexity(query: str) -> str:
classifier = Agent(
model="anthropic.claude-haiku-4-5-20251001",
instruction=(
"Classify the query complexity. Reply with exactly one word: "
"'simple' (factual lookup), 'standard' (single-domain reasoning), "
"or 'complex' (multi-domain synthesis, comparative analysis)."
),
)
return classifier(query).output.strip().lower()
The classifier's accuracy doesn't need to be perfect. Even a 70% accurate router on a workload with 60% simple, 30% standard, 10% complex queries delivers an order-of-magnitude average cost reduction over a "Sonnet for everything" baseline.
2. The routing table
Mapping classifier output to model and parameters:
| Classifier label | Model | Other parameters |
|---|---|---|
simple |
Haiku 4.5 | max_tokens=500, no extended thinking |
standard |
Sonnet 4.6 | max_tokens=2000, re-ranking on for RAG |
complex |
Sonnet 4.6 with extended thinking, or Opus 4.7 if complexity_score > threshold |
max_tokens=4000, extended thinking, top-5 RAG chunks |
Most production routing tables have three tiers. Some workloads benefit from a fourth — a "trivial" tier handled by a Llama 4 Scout or a rules-based response without invoking a model at all (FAQ matching, intent detection from a small set of intents). For workloads with very high simple-query volume, the trivial-tier adds another order of magnitude.
3. Escalation on low confidence
The router can be wrong. The defensive pattern: if the lower-tier model's response carries a confidence signal below threshold, escalate to the next tier. Confidence signals include:
- The model itself returning "I don't know" or "I'm not confident"
- A grounding score from Guardrails below threshold (the response isn't well-supported by retrieved context)
- A self-evaluation step that scores the response on a quality rubric (cheap, on Haiku)
Escalation is a small percentage of total queries but recovers the quality on the cases where the router was wrong. The combined effect: router accuracy doesn't need to be high because escalation catches the misses.
What the cascade actually delivers
For a typical agent workload with the distribution above (60% simple / 30% standard / 10% complex):
- Baseline (Sonnet for everything): 100% × 4× Haiku cost = 4× Haiku-equivalent per query
- Cascade routed: 60% × 1× + 30% × 4× + 10% × 12× + classifier overhead ≈ 2.8× Haiku-equivalent per query
- Savings: ~30% across the average query
For a workload with heavier tail of trivial queries (80% simple, 15% standard, 5% complex):
- Baseline: 4× Haiku cost
- Cascade: 80% × 1× + 15% × 4× + 5% × 12× ≈ 2× Haiku-equivalent per query
- Savings: ~50% across the average query
These are conservative — many production workloads see steeper distributions and correspondingly larger savings. The router pays for itself within hours of deployment on any workload with a long tail of simple queries.
Prompt caching — the second-largest lever
Bedrock prompt caching (introduced 2024, expanded throughout 2025-26) caches portions of the prompt across invocations. The cached portion is charged at a fraction of the standard input rate (typically ~10% of normal input cost) and is delivered as cached on subsequent invocations.
The use cases where prompt caching dominates:
- Large system prompts: agent system prompts with detailed instructions, long tool descriptions, embedded examples — the entire prompt prefix is identical across every invocation.
- Long context documents: a RAG workflow where the same retrieved chunks are reused across multiple turns; an agent that holds a long-running conversation about a single document.
- Few-shot example sets: when the same set of training examples is included in every prompt for in-context learning.
Configure caching on the Bedrock invocation:
response = bedrock_runtime.converse(
modelId="anthropic.claude-sonnet-4-6-20251022",
messages=[
{
"role": "user",
"content": [
{
"text": LARGE_SYSTEM_PROMPT,
"cachePoint": {"type": "default"} # cache everything before this point
},
{
"text": user_query # uncached — varies per call
}
]
}
]
)
The savings: for an agent with a 5,000-token system prompt invoked 100,000 times a day, caching converts that recurring 500M tokens/day of input from full input rate to ~10% of input rate. At Sonnet pricing, that's a material per-day saving — easily four-figure-USD-per-day on high-volume workloads.
Two operational notes:
- Cache TTL is bounded (typically 5 minutes by default; longer-lived caches are configurable per region). High-volume workloads keep the cache warm naturally; low-volume workloads see frequent cache misses and lower hit rates.
- Cache invalidation is by content hash. Any change to the cached prefix invalidates the cache. This is why agent system prompts that change frequently see lower cache hit rates — version-control the system prompt and treat changes as a deliberate event.
Batch inference — for non-real-time work
For workloads where latency doesn't matter (overnight batch processing, large-corpus enrichment, scheduled analytics), Bedrock Batch Inference prices the same calls at roughly 50% of the on-demand rate. The trade-off: response latency is hours-to-days instead of seconds.
Use cases that fit batch:
- Embedding ingestion for Knowledge Bases — embedding a 10GB corpus of documents is a batch job, not a real-time call
- Document classification at scale — overnight processing of a day's worth of inbound documents
- Synthetic data generation for fine-tuning (per Part 4)
- Periodic re-summarisation of long-running content
- Eval-harness runs against test sets
Batch inference is dispatched via S3 input/output, with the input being a JSONL file of prompts and the output being a JSONL file of completions. The integration with Step Functions (Part 5) is clean — a batch job is a single state in the workflow with a poll-and-wait pattern.
The cost savings are roughly 50% on the model calls themselves. For batch-eligible workloads, the savings are immediate and structural.
Provisioned throughput vs on-demand — when each makes sense
On-demand pricing is the default — you pay per token, no commitment, scales with demand. Provisioned throughput reserves capacity for a model: you pay a fixed hourly rate for guaranteed throughput, regardless of usage.
The break-even decision tree:
| Workload characteristic | Right pricing model |
|---|---|
| Unpredictable / bursty traffic | On-demand |
| Sustained high-volume traffic with predictable load | Provisioned throughput if the utilisation breakeven is achievable |
| Latency-sensitive (need to bypass on-demand throttling) | Provisioned throughput |
| Custom models (fine-tuned or imported via Custom Model Import) | Provisioned throughput required for some |
| Steady embedding ingestion or batch processing | On-demand (use Batch Inference rates) |
| Real-time agent workloads at predictable volume | Calculate breakeven; often Provisioned throughput above ~50% utilisation |
The breakeven calculation: divide the hourly provisioned-throughput cost by the per-token on-demand cost, multiply by the model units' token throughput per hour. If sustained traffic is above that threshold for the contract duration, provisioned throughput is cheaper. Below it, on-demand wins.
Provisioned throughput is sold in 1-month or 6-month commitments, with the 6-month carrying a discount. For production workloads with predictable load, 6-month provisioned throughput on Sonnet often makes sense for the cost-stability and latency-guarantee reasons alone.
Cost allocation tags — the prerequisite to optimisation
You cannot optimise what you cannot measure. Tag every Bedrock-consuming resource with at least:
Workload— the application or product the spend belongs toTeam— the team responsible for the workloadEnvironment— dev / staging / prodCostCenter— for chargeback to internal cost centresModel— which model tier (this one is the Bedrock-specific one)
For Bedrock specifically, model invocations themselves don't carry tags directly — tags are applied to the consuming resource (the Lambda function, the Step Functions state machine, the API Gateway). The pattern that works:
- Tag the Lambda that calls Bedrock with
Workload=customer-supportandModel=sonnet - Tag a different Lambda calling Haiku with
Workload=customer-supportandModel=haiku - Cost Explorer then attributes Bedrock spend per tag combination
For workloads using direct Bedrock task in Step Functions (per Part 5), tag the state machine and use distinct state-machine names per workload — the per-state-machine Bedrock charge is attributable.
For workloads using AgentCore, the AgentCore runtime emits per-agent cost metrics natively, making per-agent attribution work without manual tagging.
The cost dashboard — what to instrument
The cost-dashboard fields that matter, refreshed daily:
- Per-workload spend, current month and trended over 3 months
- Per-model-tier spend, with the cascade router's effectiveness measurable as Haiku-to-Sonnet ratio
- Cache hit rate, per workload — track it; below 60% is usually a system-prompt-stability problem
- Token volume, broken out by input/output and by model — output token volume often surprises teams when an agent generates verbose responses
- Cost per query, per workload — the unit-economics metric that ties spend to business value
- Anomaly alerts on per-workload spend — a 3× day-over-day jump triggers a page
CloudWatch dashboards built from the metrics in Part 6 cover most of this. Cost Explorer covers the rest, with daily granularity and tag filters.
For finance-grade attribution, AWS Cost and Usage Reports (CUR) deliver per-resource per-hour spend with tags into S3, queryable from Athena or QuickSight. This is the level of detail a CFO will eventually ask for; standing it up early is cheaper than rebuilding it under deadline.
What surprises teams in production
Five cost patterns that show up in real Bedrock deployments:
- The verbose-response tax. Output tokens cost 4-5× input tokens. An agent that generates 3,000-token responses when 500 would do is paying 6× more on the output side than necessary.
max_tokensis the easiest single dial. - The runaway agent loop. An agent in a tool-call loop that doesn't terminate cleanly can eat 20-50× a normal turn's tokens. Set hard turn limits; monitor for outliers.
- The dev-traffic-in-prod problem. A development workload using prod model IDs and prod tags inflates the prod cost dashboard. Strict environment tagging is the fix.
- The forgotten provisioned throughput. A team provisions throughput for a workload that never reaches utilisation. Set a quarterly review of all provisioned commitments against actual usage.
- The unbounded context window. Claude's 200K-token context window is a feature, not a target. An RAG workflow that stuffs 50 chunks into context when 5 would do is paying 10× on input cost for diluted attention. Top-K discipline (Part 2) directly drives cost discipline.
The combined picture
When the patterns from the prior pieces compose, the cost structure for a representative production workload becomes:
| Optimisation | Multiplicative effect |
|---|---|
| Baseline (Sonnet for everything, no caching, on-demand) | 1.0× |
| Cascade routing applied | ~0.5–0.7× |
| Prompt caching on system prompt | ~0.85× of the input-token portion (overall ~0.95×) |
| RAG top-K discipline (5 vs 50 chunks) | ~0.6× of the input-token portion (overall ~0.8×) |
| Batch inference where applicable | ~0.5× of the batched portion |
| Provisioned throughput where utilisation justifies | ~0.7× of the steady-state portion |
The combined effect of all of the above on a workload that fits them: roughly 25–35% of the baseline cost. Three-to-four-times cheaper, with no measurable quality regression and often improved latency.
The Claude-first / multi-model routing rule that runs through this series is, at the cost layer, the difference between Bedrock workloads that scale economically and ones that don't.
What's next
Part 8 closes the series with the case study — an SRE AI Agent on Bedrock that observes CloudWatch logs, diagnoses incidents, and executes remediation actions. Every architectural decision documented across Parts 1–7 lands in one worked implementation. The agent is built on Strands + AgentCore, uses Claude Sonnet for the reasoning step and Haiku for the routing layer, reaches into Cohere embeddings for log retrieval, executes through Lambda action tools, runs under Guardrails with event.interrupt() gates for destructive actions, and is fully observable end-to-end.
The full series:
- Part 1 — Foundations: Building AI Agents on Amazon Bedrock
- Part 2 — RAG with Bedrock Knowledge Bases
- Part 3 — Open-source Agent Frameworks on Bedrock
- Part 4 — Model Customization on Amazon Bedrock
- Part 5 — Multi-step AI Workflows with Step Functions and Bedrock
- Part 6 — Security Guardrails and Observability for Bedrock
- Part 7 — Cost Optimization on Bedrock (this piece)
- Part 8 — Case Study: An SRE AI Agent on Bedrock for CloudWatch Log Triage
The cost discipline documented here applies across every prior piece in the series. Without it, the architecture from Parts 1–6 is technically sound but economically vulnerable.
