Engineering

Multi-Model AI on Amazon Bedrock: How We Deploy the Right Model for Every Task

cmdev10 min read
Multi-Model AI on Amazon Bedrock: How We Deploy the Right Model for Every Task
Share
~15 min

The model selection problem

A client in Lagos asked us to build a document processing system last year. The requirements were straightforward on paper: classify incoming documents by type, extract structured data from each one, generate embeddings for search, and flag anything containing personally identifiable information. Four distinct tasks. The instinct — and what most teams do — is to pick one powerful model and throw everything at it.

We tried that first. Claude handled all four tasks competently. But "competently" is not the same as "efficiently." We were spending $0.015 per document on classification alone — a task that required roughly 50 input tokens and a single-word response. That is the equivalent of hiring a senior architect to sort your mail. The classification accuracy was excellent, but the cost-per-document made the project uneconomical at the volumes our client needed (tens of thousands of documents per month).

This is the model selection problem in practice. No single model wins at everything. Large language models optimized for complex reasoning are wasteful on simple classification. Embedding models produce dense vector representations but cannot reason about content. Image generation models cannot read text. Each model family occupies a specific point on the capability-cost-latency curve, and production systems that ignore this reality either overpay or underperform.

The answer is not to find the perfect model. It is to build an architecture that routes each task to the model best suited for it.

Our Bedrock architecture

Amazon Bedrock gives us access to multiple foundation models through a single API surface. We do not manage GPU instances, handle model weights, or maintain inference infrastructure. We make API calls, and Bedrock handles the rest. For a team deploying AI solutions across African enterprises — where GPU availability is limited and infrastructure complexity is a liability — this tradeoff is worth it.

Our standard multi-model architecture assigns models to task categories based on empirical testing across client workloads.

Claude (Anthropic) handles complex reasoning, multi-step analysis, document understanding, and any task requiring nuanced interpretation. When a contract needs clause-by-clause analysis, when a financial report needs to be summarized with accurate numerical reasoning, when an ambiguous customer request needs to be understood in context — Claude is the model we route to. It is the most capable model in our stack and the most expensive per token. We use it only where that capability is required.

Amazon Titan serves two roles. Titan Embeddings generates vector representations of text for semantic search and retrieval-augmented generation (RAG) pipelines. Titan Text Generation handles straightforward text tasks where Claude's reasoning depth is unnecessary — template-based responses, simple reformatting, and structured data extraction from well-defined schemas.

Stability AI handles image generation when client applications require it. Product visualization, document layout generation, and marketing asset creation route through Stability's diffusion models on Bedrock.

Cohere powers semantic search and reranking. When a user queries a document corpus, Cohere's reranking model scores candidate results by semantic relevance, improving retrieval quality beyond what keyword matching or raw embedding similarity provides.

Mistral is our workhorse for fast, cheap classification and routing decisions. Mistral models respond in tens of milliseconds, cost a fraction of a Claude call, and handle binary or multi-class classification with high accuracy. In most of our pipelines, Mistral is the first model that touches an incoming request — it decides what happens next.

These models connect through a routing layer that examines each incoming task and dispatches it to the appropriate model. The routing layer itself is a lightweight Mistral call (or in some cases, a deterministic rule based on the API endpoint or document metadata). The key architectural principle: the router must be cheaper and faster than every model it routes to.

Practical patterns: prompt routing, fallback chains, cost optimization

The routing layer is the core of a multi-model architecture. We implement it as a classification step that runs before the primary model call.

An incoming request hits the router with minimal context — typically the first 200 tokens of the input plus metadata about the source and requested operation. The router classifies the request into a complexity tier: simple, moderate, or complex. Simple requests (template fills, yes/no classification, format conversion) route to Mistral or Titan. Moderate requests (structured extraction from semi-structured documents, summarization with specific constraints) route to Titan Text or a smaller Claude model. Complex requests (multi-document reasoning, nuanced analysis, ambiguous interpretation) route to Claude.

This tiered routing reduces costs substantially because the distribution of real-world requests is heavily skewed toward simplicity. In a typical document processing pipeline, 60-70% of incoming tasks are simple classification or extraction from well-structured forms. Only 10-15% require Claude-level reasoning. Routing everything through Claude means paying complex-task prices for simple-task work.

Fallback chains add resilience. If the primary model for a task returns an error, times out, or produces output that fails validation, the system automatically routes to a backup model. Our standard fallback chain for text tasks is: primary model -> Claude (as universal fallback) -> error queue for human review. The fallback adds latency but prevents task failure. In production, fallback activation rates run below 2% for most pipelines, so the cost impact is negligible.

Cost optimization also comes from caching. Embedding generation is deterministic — the same input always produces the same vector. We cache embeddings aggressively, which eliminates redundant Titan calls for documents that have already been processed. For classification, we maintain a lookup table of previously classified document types. If an incoming document's metadata matches a cached classification with high confidence, we skip the Mistral call entirely.

Bedrock Guardrails

Every model call in our pipelines passes through Bedrock Guardrails. This is not optional. When you process documents for financial institutions, insurance companies, and government agencies in Nigeria, every API response is a potential compliance event.

We configure guardrails as a cross-cutting concern — they apply uniformly across all model invocations regardless of which foundation model is handling the task. The guardrail configuration specifies content filtering policies (blocking harmful content generation), PII detection and redaction policies, and topic restrictions (preventing models from generating content outside their intended scope).

PII redaction is the guardrail we use most heavily. A typical document processing pipeline ingests contracts, invoices, KYC documents, and correspondence that contain names, phone numbers, Bank Verification Numbers (BVNs), and national identification numbers. Our guardrails are configured to detect and redact PII before model outputs are stored or returned to client applications. The model can reason about a document containing a customer's BVN, but the BVN itself never appears in logs, cached responses, or downstream outputs.

The practical implementation works like this: we define a guardrail in Bedrock with PII entity types specified (name, phone number, email, national ID, financial account number). We attach this guardrail to every model invocation via the guardrail identifier and version in the API call. Bedrock intercepts the model's response, scans for PII entities, and either redacts them (replacing with placeholder tokens) or blocks the response entirely if the PII density exceeds a threshold. This happens at the platform level — our application code does not need to implement its own PII scanning.

This matters for NDPA compliance. The Nigeria Data Protection Act requires data controllers to implement appropriate technical measures to protect personal data. Bedrock Guardrails give us an auditable, consistent mechanism for PII handling that applies across every model in our stack.

Real-world deployment: document processing pipeline

Here is a concrete pipeline we run in production for a client processing commercial insurance documents.

Step 1: Ingestion and classification. A document arrives via API upload or email integration. The system extracts text (using Amazon Textract for scanned documents, direct parsing for PDFs with embedded text). The first 500 tokens of extracted text, along with file metadata (filename, source, file type), are sent to Mistral for classification. Mistral returns a document type label: policy_application, claim_form, endorsement_request, premium_invoice, correspondence, or unknown. This call takes 40-80 milliseconds and costs roughly $0.0001. Documents classified as "unknown" are flagged for human review.

Step 2: Structured extraction. Based on the classification, the system selects an extraction prompt template specific to that document type. A policy application template instructs the model to extract: policyholder name, business type, coverage requested, sum insured, risk location, and proposed inception date. The document text and extraction template are sent to Claude, which returns a structured JSON object with the extracted fields plus confidence scores. This is where Claude's reasoning capability matters — insurance documents are inconsistently formatted, contain ambiguous language, and sometimes contradict themselves. Claude handles these edge cases; a simpler model would hallucinate or fail silently. This call takes 2-5 seconds and costs $0.005-0.02 depending on document length.

Step 3: Embedding and indexing. The extracted structured data and the original document text are sent to Titan Embeddings, which generates a 1,536-dimensional vector representation. This vector is stored in a vector database (Amazon OpenSearch Serverless with vector engine) alongside the structured metadata. The embedding enables semantic search across the document corpus — an underwriter can search for "construction projects in Lagos with flood risk" and retrieve relevant policy applications even if those exact words do not appear in the documents. The embedding call costs approximately $0.0001 per document.

Total cost per document: $0.005-0.02. Total processing time: 3-8 seconds. Three models, each handling the task it is best suited for.

Cost comparison: single large model vs multi-model routing

We ran both architectures in parallel for 30 days on the same document corpus to get real numbers. The corpus contained 12,400 commercial insurance documents of mixed types.

With a single-model approach (Claude handling classification, extraction, and embedding-equivalent summarization), the total cost was $412 for the month. Average per-document cost was $0.033. Classification accuracy was 94%. Extraction accuracy was 91%.

With multi-model routing (Mistral for classification, Claude for extraction, Titan for embeddings), the total cost was $178 for the month. Average per-document cost was $0.014. Classification accuracy was 92% — slightly lower than Claude but well within acceptable range. Extraction accuracy remained at 91% because Claude still handled extraction. Embedding quality improved because Titan's embedding model produces purpose-built vector representations rather than the text summaries we were generating as a proxy with Claude.

The multi-model approach reduced costs by 57% while maintaining extraction accuracy and improving search quality. Classification accuracy dropped 2 percentage points — a tradeoff we accepted because misclassified documents simply route to a generic extraction template, and the extraction model (Claude) is robust enough to handle the mismatch.

The cost savings compound at scale. At 50,000 documents per month, the single-model approach would cost roughly $1,650. The multi-model approach would cost approximately $700. That $950 monthly difference funds other improvements to the pipeline.

When to use Bedrock vs self-hosted

Bedrock is not always the right answer. We use a decision framework based on four factors.

Use Bedrock when you need access to multiple model families. Running Claude, Mistral, Titan, and Stability AI on your own infrastructure means managing four different model serving stacks, each with its own GPU requirements, scaling characteristics, and operational overhead. Bedrock abstracts all of this behind a unified API. If your architecture depends on model diversity, Bedrock's value proposition is strong.

Use Bedrock when you want managed guardrails and compliance features. Implementing PII detection, content filtering, and toxicity screening from scratch is months of engineering work. Bedrock provides these as configuration, not code. For regulated industries — which describes most of our client base in financial services and insurance — this matters.

Use Bedrock when your workload is variable or unpredictable. Bedrock charges per token with no minimum commitment (on-demand pricing). If your document volumes spike during regulatory filing periods and drop during holidays, you pay only for what you use. Self-hosted GPU instances charge by the hour whether they are processing documents or sitting idle.

Self-host when you have fine-tuned models that are central to your product. Bedrock supports custom model import, but the workflow is more constrained than running your own inference stack. If you have invested heavily in fine-tuning a model on proprietary data and that fine-tuned model is your competitive advantage, self-hosting gives you full control over the training and serving pipeline.

Self-host when you have predictable, high-volume workloads that would benefit from reserved GPU capacity. If you process one million documents per month with minimal variance, reserved GPU instances on EC2 (or a dedicated inference endpoint) will be cheaper than Bedrock's per-token pricing. The breakeven point depends on the specific models and instance types, but for sustained high-throughput workloads, self-hosting typically wins on unit economics.

Self-host when you need inference latency guarantees below what Bedrock provides. Bedrock's on-demand endpoints introduce variable latency based on service load. For real-time applications where every millisecond matters (live trading, interactive voice agents), a dedicated inference endpoint — whether self-managed or through Bedrock's provisioned throughput — gives you predictable performance.

For most of our client deployments, Bedrock is the right choice. The African enterprise market is characterized by variable workloads, regulatory complexity, limited in-house ML infrastructure teams, and a need for rapid deployment. Bedrock lets us ship multi-model AI systems in weeks rather than months, with compliance features built in rather than bolted on. When a client's workload grows to the point where self-hosting makes economic sense, we help them migrate — but that is an optimization for scale, not a starting requirement.

aiamazon-bedrockmulti-modelawsclaudearchitecture

Ready to strengthen your security posture?

We help organizations across Africa build resilient infrastructure, deploy AI at scale, and navigate complex regulatory environments.

Start a conversation