Engineering

Model Customization on Amazon Bedrock: When Prompt Engineering Stops Being Enough

cmdev11 min read
Model Customization on Amazon Bedrock: When Prompt Engineering Stops Being Enough
Share
~16 min

Series · Amazon Bedrock for Production AI · Part 4 of 8 ← Part 3: Open-source Agent Frameworks · Model Customization · Part 5: Step Functions Orchestration →

Most teams reach for fine-tuning two phases too early

The first reflex when a Bedrock-backed agent underperforms is to assume the model needs customization. It usually doesn't. The first three things to try, in order:

  1. Better prompts. Tighten the system prompt, add few-shot examples, restructure the task into smaller steps, use Claude's extended-thinking mode where reasoning depth matters. Prompt iteration is days of work; fine-tuning is weeks of work.
  2. Better RAG. Re-evaluate chunking, embedding model, vector store choice, hybrid search weights, re-ranking. The retrieval failures from Part 2 account for the majority of "the model is bad" diagnoses we see.
  3. Better tool design. Narrower tools with sharper OpenAPI descriptions, fewer redundant tools, more explicit input/output schemas. The agent's tool-call quality is mostly a function of tool design, not model capability.

After those three, customization enters the conversation. The decision is not "do we customize?" — it's "which customization path, on which model, for which task?" — and the honest answer for many use cases is still "none of them."

The honest constraint: Claude is not customisable on Bedrock

Worth stating at the top, because it shapes everything below: Anthropic does not expose fine-tuning of Claude on Bedrock. Claude is available for inference (and as the production-default reasoning model per Part 1), but it is not a customization target.

The Bedrock customization options apply to:

  • Llama (Meta) — supervised fine-tuning, continued pre-training
  • Amazon Titan family — supervised fine-tuning, continued pre-training, distillation
  • Cohere Command — supervised fine-tuning
  • Mistral — supervised fine-tuning
  • Your own model via Custom Model Import — any compatible architecture (Llama, Mistral, Falcon, Mixtral variants)

This is not a bug — it's the architectural reality, and it shapes the combined-models pattern this piece builds toward. Pinned Claude for the hard reasoning step, customised smaller model for the narrow recurring task that doesn't need Claude's depth. Each does what it's good at.

The four customization paths

Customization decision tree: 1) Don't have enough labelled data → Continued Pre-Training (CPT, unlabeled). 2) Have hundreds-to-thousands of input/output pairs for a specific task → Supervised Fine-Tuning. 3) Want Claude-quality on a narrow task at smaller-model cost → Distillation (Claude teacher → Llama/Titan student). 4) Have an external model not in the Bedrock catalogue → Custom Model Import. Each path includes evaluation harness and rollback gate.

Path 1 — Continued Pre-Training (CPT)

CPT extends a foundation model's training on a large corpus of unlabelled domain text. The output is a model that "speaks the domain" — knows the terminology, the conventions, the typical patterns — without being trained for any specific task.

When to use: When you have a large domain corpus (typically 100MB+ of high-quality text), you don't have labelled task data, and the foundation model's responses sound like a generalist who hasn't read the domain literature.

When not to use: When you have specific task-completion problems. CPT changes the model's familiarity with the domain; it does not teach a specific task.

Cost shape: Multi-day training run. Bedrock prices CPT by training hours and final model size; typical CPT runs land in the low-thousands of USD for Titan-class models, higher for larger.

Production reality: Rare. Most teams that think they need CPT actually need better RAG — the corpus belongs in a Knowledge Base, not baked into a model. CPT is right for closed domains where retrieval-augmented patterns don't fit (highly specialised scientific or medical workflows, deeply technical engineering domains).

Path 2 — Supervised Fine-Tuning (SFT)

SFT trains a model on input/output pairs to produce specific responses. The output is a model that does this specific task well.

When to use: When you have hundreds-to-thousands of high-quality input/output pairs, the task is narrow and well-defined, and prompt engineering has hit a quality ceiling.

When not to use: When the task is broad (general reasoning, multi-step planning, tool use across many domains). SFT excels at narrow specialisation; it does not excel at making a model more capable in general.

Cost shape: Hours-to-days of training. Materially cheaper than CPT for the same model size. The bigger cost is the labelled dataset — getting 1,000 high-quality input/output pairs is non-trivial.

Production sweet spots: Document classification, structured extraction (resume parsing, invoice line-item extraction, log-format normalisation), style transfer (formal-to-casual rewriting, brand-voice generation), narrow generation (specific report formats, regulatory-filing language).

Working SFT input format on Bedrock (JSONL):

{"prompt": "Extract invoice fields from: ACME LTD #INV-2026-0042 ...", "completion": "{\"vendor\": \"ACME LTD\", \"invoice_id\": \"INV-2026-0042\", ...}"}
{"prompt": "Extract invoice fields from: Beta Co. Invoice 2026-0091 ...", "completion": "{\"vendor\": \"Beta Co.\", \"invoice_id\": \"2026-0091\", ...}"}

Train against Llama 4 Maverick or Titan Text Premier; deploy the resulting custom model via on-demand or provisioned throughput; invoke through the same Bedrock API as the foundation models.

Path 3 — Distillation

Distillation trains a smaller "student" model to mimic the behaviour of a larger "teacher" model on a specific task distribution. The output is a model that approaches the teacher's quality on the task while running at the student's cost.

When to use: When you've validated that a large model (Claude Sonnet or Opus, or Llama-large) produces the quality you need for a specific task, the task is high-volume enough that the per-call cost matters, and you want most of that quality on a smaller model that runs cheaper.

The Claude-to-smaller-model distillation pattern: Run Claude over the task corpus to generate high-quality input/output pairs; treat these as the training set for SFT on a smaller open model (Llama 4 Scout, Titan Text Lite, or similar). The distilled model captures much of Claude's task-specific quality at a fraction of the per-token cost.

Production sweet spot: High-volume narrow tasks where Claude Sonnet is the quality bar but its per-token cost is prohibitive at scale. Examples: per-message classification across millions of daily messages, real-time content tagging, latency-critical extraction where Claude's latency is too high.

Cost shape: The expensive part is the teacher inference (paying Claude rates to generate training data). The training itself is SFT-class cost. The savings come at deployment: every production call goes to the smaller, cheaper student, not the teacher.

This is the most operationally impactful customization path for cmdev work in 2026 — it's the rare case where customization pays for itself in months, not years.

Path 4 — Custom Model Import

Custom Model Import (CMI) takes a model you've trained or obtained elsewhere — fine-tuned Llama, custom Mistral variants, your own pre-trained model — and makes it invokable through the standard Bedrock API.

When to use: When you have a model that isn't in the Bedrock catalogue but you want to consume it through the same API surface as Bedrock-native models. Compliance, billing consolidation, and operational simplicity argue for putting all model calls behind one API.

Constraints: The model architecture must be compatible (currently Llama, Mistral, Falcon, Mixtral variants and their derivatives — check the current compatibility list). Custom-trained transformers in novel architectures are not yet supported.

Cost shape: Custom Model Units (CMUs) for inference capacity. Different cost structure from on-demand foundation-model pricing. Right for steady-state workloads where the CMU reservation cost is amortised by sustained usage.

Production reality: Most commonly used by teams who have trained domain-specific Llama variants externally (HuggingFace, SageMaker training jobs, on-prem fine-tuning) and want them in the Bedrock invocation path for compliance and observability consistency.

The combined-models pattern

The right architectural conclusion from the above is not "customize one model" but "combine customised and frontier models per task." The pattern:

Combined-models topology: User query enters a router (small Haiku-class classifier). Router decides task type. Hard reasoning, complex synthesis, multi-step planning route to pinned Claude Sonnet/Opus. Narrow recurring tasks (invoice extraction, content tagging, log classification) route to custom-tuned Llama/Titan via Bedrock Custom Model. Both invocations pass through Guardrails. Both emit model-invocation logs to the same S3 audit bucket. Cost-tag per model identifies per-tier spend.

The reasoning:

  • Claude is irreplaceable for the hard step. Complex synthesis across retrieved context, novel multi-step planning, ambiguous tool selection, anything requiring the kind of reasoning Claude excels at. Pin a specific Claude version and use it without apology.
  • Custom-tuned smaller models win the narrow recurring step. The invoice extractor that runs ten thousand times a day does not need Claude's depth. It needs to do exactly that task, reliably, at a fraction of the per-call cost. A distilled or SFT'd Llama or Titan is the right tool.
  • A router decides which model gets the call. Often Claude Haiku (cheap, fast, good enough at classification) reads the incoming query and routes. The routing decision itself costs ~$0.001; the savings on routed-to-cheap-model queries pay for it tens of thousands of times over per day.
  • Both paths share the same observability and audit infrastructure. Model-invocation logs in the same S3 bucket. Guardrails on both invocation paths. Cost tags on both. The operational layer is unified even when the model layer is split.

A working pattern in Strands (per Part 3):

from strands import Agent, tool

@tool
def extract_invoice(text: str) -> dict:
    """Extract structured fields from an invoice."""
    custom_extractor = Agent(
        model="arn:aws:bedrock:us-east-1:123:custom-model/invoice-extractor-v3",
        instruction="Extract: vendor, invoice_id, date, line_items, total. Return JSON.",
    )
    return parse_json(custom_extractor(text).output)

reasoning_agent = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",  # Claude for reasoning
    tools=[extract_invoice, ...],
    instruction="You are a finance ops agent. Analyse invoices and answer questions.",
)

reasoning_agent("What's our total spend with ACME this quarter, with anomalies highlighted?")

The reasoning agent runs Claude. The extraction tool runs a custom-tuned Bedrock model. Both invocations are observable, cost-attributable, and Guardrails-wrapped.

The evaluation harness — non-optional

You cannot ship a customised model to production without a regression-testing harness. The shape:

  • A held-out test set — 100-500 examples never used in training. Each example has the input, the expected output (or expected-output rubric for generative tasks), and metadata (difficulty, category, language).
  • Quality metrics per example:
    • Classification tasks → accuracy, precision/recall per class, confusion matrix
    • Structured extraction → field-level F1, schema-compliance rate
    • Generation tasks → BLEU/ROUGE plus LLM-as-judge against a quality rubric (Claude Sonnet is the right judge model — uses a frontier model to score outputs against criteria)
    • Latency and cost per example
  • Comparison against the baseline — the foundation model without customization is the baseline. The customised model has to beat it on the quality metric and on the cost-or-latency dimension that motivated the customization. If it doesn't, ship the foundation model and call the customization experiment what it was: an experiment.
  • Drift detection in production — periodically run the harness against the customised model in production. Production performance can degrade as input distributions shift (new vendors in the invoice extractor, new log formats in the classifier). The drift signal is the trigger to re-train.

Without the harness, every customization decision is faith-based. With it, customization is engineering.

Operational concerns

Five things that bite in production:

  • Training-data IP and copyright. The training corpus has to be data you have the right to train on. Customer-derived data needs explicit consent or contractual basis. Public-web data has uncertain legal status in some jurisdictions. The cleaner the data provenance, the cleaner the deployment.
  • Personally identifiable information (PII) in training data. Customised models can memorise specific examples and regenerate them under certain prompts. PII in training data is therefore PII at inference time, which is a NDPA / GDPR / NIS2 problem. Redact or synthesise before training.
  • Evaluation cost. Running a thousand-example harness against three candidate models is not free — it's a real LLM-as-judge bill. Budget it.
  • Deployment cost structure. On-demand pricing for custom models is competitive but assumes burst usage. Provisioned throughput is cheaper per-call for sustained workloads but commits to capacity. Calculate break-even against actual traffic before choosing.
  • Versioning and rollback. Customised models are versioned. Production should always have at least two versions deployable so a regression can be rolled back without retraining. Treat custom-model deployment like database migrations: forward-only is a trap.

When to keep going with prompts and RAG

The most underused move in 2026 production AI is to not customize the model at all. The signals it's the right call:

  • The task quality is improving with each prompt iteration. Keep iterating.
  • The RAG harness is improving with each retrieval tweak. Keep tuning.
  • The variance is dominated by edge cases that look idiosyncratic, not by systematic gaps. More training data won't fix idiosyncrasy.
  • The volume doesn't justify the customization cost. A task with 1,000 daily invocations on Claude Sonnet costs about what a junior analyst's morning coffee bill is. Customisation pays back at higher volume, not always.
  • The model catalogue moves faster than you do. By the time a customised model ships, the next-generation foundation model may match it without the customization overhead.

Customisation is a real lever. It is also a lever many teams pull when the actual problem is upstream. The combined-models pattern works because each model does what it's good at. The first move is to be honest about what the foundation model is good at, and whether you've actually exhausted that envelope.

What's next

Part 5 picks up the orchestration layer one level higher: when a single Bedrock Agent isn't the right shape and the work needs to span multiple model invocations, multiple AWS services, and conditional branches — the AWS Step Functions story. The combined-models pattern from this piece composes naturally into Step Functions workflows; Part 5 unpacks how.

The full series:


The Amazon Bedrock series. Customised models meet production-grade observability in Part 6 and cost-tier routing in Part 7.

amazon-bedrockmodel-customizationfine-tuningcontinued-pretrainingdistillationcustom-model-importclaudellamatitanevaluation

Ready to strengthen your security posture?

We help organizations across Africa build resilient infrastructure, deploy AI at scale, and navigate complex regulatory environments.

Start a conversation