The demo works. Production does not.
Every AI proof-of-concept we have seen looks impressive. A product manager opens a notebook, feeds a document into Claude, gets a structured response, and declares the feature ready. The board deck gets a slide about AI transformation. Engineering gets a two-week deadline.
Then production happens. The first document a real user uploads is a scanned PDF with handwriting in the margins. The second is a WhatsApp screenshot. The third is a 47-page contract that exceeds the model's context window. The model returns malformed JSON that crashes the downstream parser. A traffic spike on Tuesday morning exhausts the API rate limit. Nobody notices until a customer calls.
The model is roughly 20% of the work in a production AI system. The other 80% is everything around it: input validation, document preprocessing, error handling, retry logic, output verification, state tracking, cost controls, monitoring, and integration with the systems your users already depend on. This is pipeline engineering, and it is the gap between a demo and a product.
We have built AI pipelines for document processing, contract analysis, customer support triage, and regulatory compliance workflows across African enterprises. Here is the architecture pattern that has held up.
The pipeline pattern
Every production AI pipeline we deploy follows the same six-stage pattern: intake, classify, route, process, validate, store. Each stage is a discrete unit with its own error handling, logging, and retry behavior.
Intake is where documents and requests enter the system. This is not a simple API endpoint. In production, intake means handling file uploads via presigned S3 URLs, accepting webhook payloads from third-party systems, polling email inboxes, and processing messages forwarded from WhatsApp Business API. Each intake channel normalizes the input into a standard envelope — a JSON object containing the raw content, metadata (source, timestamp, user ID), and a unique pipeline execution ID. The envelope lands in an SQS queue.
Classify is a lightweight first-pass inference. A smaller, faster model (or a rules engine, depending on volume) examines the input and determines what it is and how urgent it is. For a document processing pipeline, classification might output: { "type": "invoice", "language": "en", "urgency": "standard", "pageCount": 3, "quality": "clean" }. For a scanned image, it might flag "quality": "low" and trigger an OCR preprocessing step before anything else. Classification keeps the expensive models focused on work that matches their capabilities.
Route is decision logic, not inference. Based on the classification output, a Lambda function determines which model handles the request, what prompt template to use, and what processing parameters apply. An urgent contract review goes to Claude Sonnet on Bedrock with a specialized legal extraction prompt. A routine invoice goes to a lighter model. A document flagged as low quality gets routed through Textract for OCR before reaching any language model. Routing is where you implement cost control at the architectural level — not every request deserves your most expensive model.
Process is the main inference call. The Lambda function assembles the prompt from a versioned template, attaches the preprocessed input, and calls Bedrock. This is the stage most people think of as "the AI part." In practice, it is the most straightforward step in the pipeline because Bedrock abstracts away infrastructure management. The complexity is in everything surrounding it.
Validate checks the model's output before it reaches any downstream system. This means parsing the response to confirm it matches the expected schema (Zod validation against a defined output type), running guardrails to check for hallucinated data or policy violations, and comparing extracted values against known reference data where available. If validation fails, the pipeline can retry with a modified prompt, escalate to a human reviewer, or reject the result with a clear error. Validation is not optional. Models produce confidently wrong output, and production systems must catch it.
Store persists results and updates pipeline state. Validated outputs go to the appropriate destination: a database record, a webhook to the client's system, a formatted report in S3. The pipeline execution record in DynamoDB is updated with the final status, processing time, token usage, and any validation flags. This record is what makes the pipeline observable and auditable.
Infrastructure
The stack is deliberately simple: Bedrock for inference, Lambda for orchestration, S3 for document storage, DynamoDB for pipeline state, and SQS for queuing between stages.
Serverless is the right choice for AI workloads because traffic is inherently bursty. A document processing pipeline might handle 50 requests on Monday morning and 3 on Saturday. Provisioning servers for peak load means paying for idle compute 90% of the time. Lambda scales to zero between invocations and scales up to handle bursts without manual intervention.
Each pipeline stage is a separate Lambda function. This is intentional. It means each stage can scale independently, fail independently, and be deployed independently. The classify stage might need to scale to 100 concurrent invocations during a batch upload while the validate stage handles results at a steady pace. Separate functions also mean separate timeout configurations — classification might complete in 2 seconds, but the process stage might need the full 15-minute Lambda maximum for large documents.
SQS queues between stages provide backpressure. If the process stage is overwhelmed (hitting Bedrock rate limits), messages accumulate in the queue rather than failing. The queue's visibility timeout is set to match the Lambda function's timeout plus a buffer, preventing duplicate processing. Each queue has a dead letter queue (DLQ) configured to capture messages that fail after the maximum retry count.
DynamoDB stores pipeline execution state with the execution ID as the partition key. Each stage writes its status, timing, and output metadata to the same record. This gives us a single query to see the full lifecycle of any request — when it entered, what was classified, which model processed it, whether validation passed, and where results were delivered.
Error handling and retry strategies
Model calls fail. Bedrock returns throttling errors under load, timeouts on complex prompts, and occasionally malformed responses. Production pipelines must handle all three gracefully.
For throttling and transient errors, we implement exponential backoff with jitter directly in the Lambda function:
import random
import time
def call_bedrock_with_retry(payload, max_retries=3):
for attempt in range(max_retries):
try:
return bedrock.invoke_model(**payload)
except ThrottlingException:
wait = min(30, (2 ** attempt) + random.uniform(0, 1))
time.sleep(wait)
raise PipelineRetryExhausted(payload["execution_id"])
The jitter component is critical. Without it, all concurrent retries fire at the same instant, creating a thundering herd that triggers more throttling.
For malformed output — the model returns text that does not parse as valid JSON, or JSON that does not match the expected schema — the pipeline retries with a stricter prompt that includes an explicit example of the expected format. If the second attempt also fails, the item moves to a human review queue rather than retrying indefinitely.
Graceful degradation is built into the routing stage. If the primary model is consistently throttled (detected by a rolling error rate counter in DynamoDB), the router automatically falls back to a lighter model for non-critical requests. An invoice extraction that would normally use Claude Sonnet can fall back to a smaller model with acceptable accuracy. A contract analysis that requires the full model gets queued for retry rather than degraded.
Dead letter queues capture everything that exhausts retries. A separate Lambda function processes the DLQ on a schedule, categorizing failures by type (throttle, timeout, validation, unknown) and either retrying during lower-traffic periods or escalating to an operator dashboard. Nothing is silently dropped.
Monitoring
An AI pipeline without monitoring is a cost center you cannot understand and a reliability risk you cannot quantify. We track five categories of metrics.
Latency per stage. Every Lambda function records its execution time as a CloudWatch custom metric with dimensions for pipeline name, stage, and model ID. This lets us identify which stage is the bottleneck and whether latency is increasing over time — an early signal of model performance degradation or growing input complexity.
Token usage per request. Bedrock returns input and output token counts in every response. We log these to CloudWatch and aggregate by pipeline, model, and customer. Token usage is the primary cost driver, and tracking it per-request lets us attribute costs accurately and detect anomalies — a prompt injection attack, for example, might manifest as a sudden spike in output tokens.
Model accuracy. This requires a human feedback loop. For document extraction pipelines, we sample a percentage of results and route them to human reviewers who confirm or correct the extraction. The correction rate is tracked over time. If accuracy drops below a threshold, we investigate — it usually means input quality has changed (a new document format the model has not seen) or the prompt needs updating.
Cost per request. Calculated from token usage and Bedrock pricing, aggregated by pipeline and customer. We set CloudWatch alarms for daily and weekly cost thresholds. A cost anomaly alert has caught runaway retry loops, unexpected traffic spikes, and one case where a client's system was submitting the same document repeatedly due to a webhook misconfiguration.
Pipeline throughput. Messages in flight, queue depth, completion rate, failure rate. These are the operational health metrics. A growing queue depth means processing is not keeping up with intake. A rising failure rate means something has changed — model behavior, input format, or infrastructure.
The CloudWatch dashboard for each pipeline shows these five categories on a single screen. The goal is that an operator can glance at the dashboard and know whether the pipeline is healthy without reading logs.
Scaling patterns
Bedrock offers two consumption models: on-demand and provisioned throughput. On-demand is the right default — you pay per token with no commitment, and Bedrock handles scaling. Provisioned throughput reserves model capacity at a fixed hourly rate, guaranteeing consistent latency and throughput.
The decision point is predictable volume. If a pipeline processes fewer than 10,000 requests per day with variable timing, on-demand is cheaper and simpler. If a pipeline consistently processes 50,000+ requests per day during business hours, provisioned throughput eliminates throttling risk and often reduces per-token cost. We have run both, and the crossover point depends on average prompt size and model choice.
For batch processing — nightly document processing runs, periodic compliance scans — we use a different pattern entirely. Documents are uploaded to an S3 bucket during the day. An EventBridge rule triggers a Step Functions workflow at 2 AM that fans out processing across concurrent Lambda invocations, throttled to stay within Bedrock limits. Batch processing avoids competing with real-time traffic for model capacity and benefits from lower off-peak pricing when using provisioned throughput.
SQS-based fan-out handles traffic spikes in real-time pipelines. When intake volume exceeds the processing rate, messages queue naturally. Lambda's SQS event source mapping scales up consumers automatically, up to a configured concurrency limit that matches our Bedrock rate limit. This prevents the failure mode where Lambda scales faster than Bedrock can serve, causing a cascade of throttling errors.
Lessons from deploying for African enterprise clients
The architecture above is infrastructure. Deploying it for real clients across African markets introduces a set of challenges that no AWS reference architecture addresses.
Document quality is the first problem. Enterprise clients in Nigeria send us scanned PDFs where the scan quality varies by page. They forward WhatsApp images of physical documents — photographed at angles, in poor lighting, with fingers visible in frame. They send emails with critical information buried in the signature block or in an attachment named doc(1)(final)(v2).pdf. The classification stage must handle all of this. We run a document quality assessment before any model inference: resolution check, orientation detection, OCR confidence scoring. Documents below quality thresholds get routed to a preprocessing pipeline (image enhancement, deskewing, OCR) before they reach a language model. This preprocessing adds latency but prevents the more expensive failure of feeding garbage to a model and getting garbage back.
Connectivity is not reliable. A pipeline that assumes always-on internet works in Lagos on a fiber connection. It does not work for a client's branch office in Abeokuta during a rainy season outage. We design intake to be resilient: mobile apps queue submissions locally and sync when connectivity returns. Webhook endpoints accept batched submissions. The pipeline itself is asynchronous by design — clients submit and receive a tracking ID, then poll or receive a callback when processing completes. No part of the pipeline requires a synchronous round-trip from the client.
Operators are not engineers. The person monitoring the pipeline at a client's insurance company is an operations manager, not a DevOps engineer. Dashboard design matters. Error messages must say "3 invoices failed processing and need manual review" not "DLQ message count exceeded threshold on queue arn:aws:sqs:eu-west-1:123456:process-dlq." We build operator-facing dashboards in the client's existing tools where possible — a simple web interface that shows pipeline status, flags items needing attention, and lets operators retry or escalate without touching AWS console.
Trust is earned incrementally. Many clients come to AI with justified skepticism. They have been sold AI solutions before — chatbots that hallucinated, OCR that mangled Yoruba names, classification models that miscategorized their documents. We deploy with a human-in-the-loop by default. The pipeline processes documents and presents results for human confirmation. As accuracy proves out over weeks, the client gains confidence and we progressively automate — moving from "human confirms every result" to "human reviews flagged results" to "human reviews a random sample." The pipeline architecture supports all three modes without structural changes because the validate stage can route to a human review queue at any confidence threshold.
Building AI systems that work in production is not about choosing the right model. It is about building the pipeline that makes the model's output reliable, observable, and useful inside the systems your clients already run. The model is a component. The pipeline is the product.
