From Discovery to Production: Building a Bedrock Document Pipeline for a London Insurance Broker

The call

The email came in on a Tuesday afternoon. A mid-sized insurance broker in London wanted to talk about AI. Their operations director had seen a demo of document extraction at an industry event and wanted to know if something similar could work for their claims department.

We got on a call. Within five minutes, the scope of the problem was clear.

The firm processes over 200 motor and property insurance claims per day. Each claim arrives as a bundle of documents: photos of the damage, a police incident report, medical records if there were injuries, the claimant's policy document, repair estimates, and sometimes receipts for towing or hospital visits. An adjuster picks up the bundle, reads every document, cross-references the claim against the policy terms, checks for fraud indicators, writes up an assessment, and passes it to a supervisor for sign-off. The supervisor either approves a payout, denies the claim, or sends it back for further investigation.

Each adjuster spends three to four hours per claim just on the reading and cross-referencing. At 200+ claims per day, the backlog was growing by the week. They had tried hiring more adjusters — experienced claims handlers are scarce and command high salaries in the London market. They had tried off-the-shelf OCR tools, but the error rates on real-world documents were unacceptable. Handwritten surveyor notes came back garbled. Low-quality phone photos of damage assessments produced unusable text. Documents arriving from EU partners in French, German, or Polish confused every tool they tested.

They wanted AI but did not know where to start. That is where we came in.

Discovery: mapping the workflow

We spent a full day onsite at the client's claims processing centre in East London. We did not bring laptops or pitch decks. We sat next to adjusters and watched them work.

The lead adjuster, a woman with eleven years in claims, pulled up a case file and walked us through it step by step. A rear-end collision on the A13 near Barking. The claimant had submitted seven documents: four photos taken on a cracked phone screen, a police incident report, a physiotherapy assessment, and the original policy document. She opened each one, read it, made notes in a spreadsheet, flipped back and forth between the police report and the policy terms, flagged a discrepancy in the reported date, checked the claimant's history in a separate system, and drafted a two-page assessment.

It took her two hours and forty minutes. She does this six to eight times per day.

We mapped the full workflow across the department:

Claim arrives (email, online portal, or post with physical documents)
Documents get scanned or uploaded
Adjuster reads every document in the bundle
Adjuster cross-references claim details against policy terms
Adjuster checks for fraud indicators — date mismatches, inflated estimates, repeat claimants
Adjuster writes a structured assessment
Supervisor reviews and approves, denies, or escalates

Steps 3 through 6 are where adjusters spend 80% of their time. Those are the steps where a machine can read faster, cross-reference more accurately, and flag inconsistencies that a human eye might miss after the fifth claim of the day. Steps 1, 2, and 7 stay human. The machine does not replace the adjuster — it gives them a first draft and a set of flags so they can focus on judgment instead of data entry.

Architecture decision: why Bedrock

We evaluated three approaches before recommending an architecture.

OpenAI API. Strong models, good at document understanding. But the client's claims contain UK citizens' personal data — National Insurance numbers, medical records, home addresses. The compliance team needed guarantees about where data is processed and stored. Under UK GDPR, sending PII to US-based endpoints without adequate safeguards raised concerns their DPO was not willing to sign off on.

Self-hosted open-source models. Full control over data residency and model behaviour. But the client has no MLOps team, no GPU infrastructure, and no appetite for maintaining model serving infrastructure. Running Llama or Mixtral on EC2 GPU instances would have introduced operational complexity that outweighed the benefits. We would have been building an MLOps practice, not a claims pipeline.

Amazon Bedrock. Managed multi-model access through a single API. Data stays within the AWS region — the client's existing infrastructure runs in eu-west-2 (London), and Bedrock is available there. Guardrails provide built-in PII redaction. Pay-per-use pricing matches their variable claim volume — higher during winter months and holiday travel season, lower in summer. Multi-model access means we can use different models for different stages of the pipeline — cheap and fast for classification, powerful and accurate for analysis.

Bedrock won on four criteria: data residency, managed infrastructure, PII guardrails, and multi-model flexibility. We presented the recommendation, the client's CTO approved it, and we moved to prototyping.

The 48-hour demo

Before committing to a five-week build, we needed to prove the concept would work on real insurance documents — not clean sample PDFs from a tutorial.

We built a working prototype in 48 hours. Three Lambda functions, one S3 bucket, one Bedrock model. The workflow was simple: upload a claim document to S3, Textract extracts the text, Claude analyses the text and generates a structured claim summary with extracted fields.

The client's compliance team provided 20 real claims with customer data redacted. We ran all 20 through the prototype. The lead adjuster — the same woman who had walked us through her workflow — reviewed every output. She confirmed that 17 out of 20 were accurate enough to be useful as a first draft. The three misses were all handwritten surveyor notes where Textract failed to produce usable text.

Seventeen out of twenty was enough. The project was greenlit.

Building the pipeline

The production pipeline took five weeks to build and deploy. Here is what it looks like.

Intake. Documents arrive through three channels: email attachments processed via Amazon SES, uploads through the client's online claims portal using presigned S3 URLs, and scanned post processed by the mailroom. Every document lands in an S3 bucket with metadata recorded in DynamoDB — source channel, claim reference number, upload timestamp, document count. Each claim gets a unique pipeline execution ID that tracks it through every subsequent stage.

Preprocessing. A Lambda function triggers on each S3 upload. Printed documents go through Textract for OCR. Handwritten documents and low-quality scans skip OCR entirely and go straight to Claude Vision, which reads directly from the image. A language detection step identifies the primary language of each document — English, French, German, or Polish — and flags documents that need translation before extraction. Cross-border claims from EU partners frequently arrive in languages other than English.

Classification. Mistral, accessed through Bedrock, classifies each document by type: police report, medical record, damage photograph, policy document, repair estimate, or receipt. This call costs roughly $0.002 per document and returns in under 200 milliseconds. The classification determines which processing template the document gets routed to next.

Extraction and analysis. Claude processes each document using a type-specific prompt template. A police report template extracts incident date, location, parties involved, and officer details. A medical record template extracts diagnosis, treatment, and prognosis. A damage photo template describes visible damage and estimates severity. Each extraction produces structured JSON. A separate Claude call then cross-references the extracted data against the client's policy database — checking coverage terms, excesses, exclusions, and policy validity dates — through an API integration with their existing policy management system.

Fraud scoring. A dedicated Claude call examines the full claim bundle after extraction is complete. It checks for date inconsistencies between documents, inflated repair estimates relative to damage severity, repeat claimant patterns, mismatched vehicle descriptions between the police report and the photos, and medical claims disproportionate to the reported incident. The output is a risk score from 1 to 10, accompanied by specific flags explaining each concern. Claims scoring above 7 are automatically routed to the client's fraud investigation unit.

Output. The pipeline generates a structured claim assessment: a plain-language summary, a coverage determination with specific policy clause references, a fraud risk score with flags, and a recommended action — approve, deny, or escalate. This assessment lands in the adjuster's dashboard. The adjuster reviews it, makes corrections if needed, and either signs off or sends it back for reprocessing. The supervisor sees the adjuster's final version alongside the AI-generated draft, with differences highlighted.

What went wrong

The first week in production was rough.

Handwritten surveyor notes failed at a 30% rate. Textract could not reliably read the handwriting — inconsistent letterforms, faded ink, photographs taken at odd angles. We had known this was a risk from the prototype phase, but 30% was worse than expected. The fix was straightforward: we stopped running handwritten documents through OCR entirely. Claude Vision reads directly from the image and produces usable text even from poor-quality photos of cursive handwriting. Accuracy on handwritten notes jumped from 70% to 91% after the switch.

Foreign-language documents from EU partners caused extraction failures. Our prompt templates assumed English-language input, and when the extraction model encountered French or German text, it either hallucinated translations or returned incomplete data. We added a translation step before extraction — Claude translates the document to English first, then a second call runs the extraction template on the translated text. Adding a step added latency and cost, but the accuracy improvement was significant.

The fraud scoring model flagged 40% of legitimate claims in the first week. Our initial prompt was too aggressive — it treated any minor inconsistency as a fraud signal. A date written as "12/01" on the police report and "12 January" on the medical record triggered a flag. A repair estimate that was 15% above the average for that vehicle model triggered a flag. Adjusters spent more time dismissing false positives than they saved on the rest of the pipeline. We tuned the prompt using 200 labelled examples from the client's historical claims data — 150 legitimate claims and 50 confirmed fraud cases. After tuning, the false positive rate dropped from 40% to 8%.

Cost spiked unexpectedly. We were sending full policy documents — some over 50 pages — to Claude for every cross-reference call. At $0.003 per 1,000 input tokens, a 50-page policy document costs roughly $0.45 per call, and each claim required multiple cross-reference passes. In the first week, the pipeline cost hit $2.10 per claim, five times our target. The fix was a RAG pipeline: we chunked and embedded every policy document using Titan Embeddings, stored the vectors in an OpenSearch index, and retrieved only the relevant sections per claim. A motor accident claim now pulls the three or four policy clauses that apply, not the entire document. Token usage dropped by 70%, and per-claim cost fell to $0.42.

Results

After eight weeks in production, the numbers told the story.

Claims processing time fell from an average of 5 days to 6 hours. The bottleneck shifted from adjuster capacity to supervisor review — a much better problem to have.

Each adjuster now handles three times more claims per day. They spend their time reviewing AI-generated assessments, applying judgment on edge cases, and handling escalations — not reading and transcribing documents.

Accuracy held up: 94% of AI-generated assessments were accepted by adjusters without modification. The remaining 6% required minor corrections, mostly on claims involving unusual policy terms or atypical damage descriptions.

Pipeline cost settled at $0.42 per claim. The client's internal estimate of adjuster labour cost for fully manual processing was $18 per claim. The maths is not subtle.

The fraud scoring module identified three previously undetected patterns in its first month — a cluster of claims from a single repair garage with inflated estimates, a repeat claimant using slight name variations across policies, and a set of staged accident reports with identical damage descriptions filed weeks apart. The client's fraud unit confirmed all three and opened investigations.

The pipeline was originally built for motor insurance claims only. Within two months of go-live, the client expanded it to property and health claims, each requiring new document type templates and extraction prompts but running on the same underlying architecture.

What we would do differently

Three lessons from this engagement that changed how we approach subsequent projects.

Start with RAG from day one. We bolted on the policy document retrieval pipeline after cost overruns forced the issue. If we had designed for retrieval-augmented generation from the beginning, we would have avoided the cost spike, the emergency refactoring, and the three days of downtime while we migrated to the new architecture. Every pipeline that touches long reference documents should assume RAG from the start.

Build the monitoring dashboard before go-live. For the first week in production, we had CloudWatch logs and nothing else. When the fraud model started flagging 40% of claims, we did not know until adjusters started complaining. When costs spiked, we did not catch it until the daily AWS billing alert. A real-time dashboard showing per-claim cost, processing time, error rates, and fraud flag rates would have caught both problems on day one. We now ship a Grafana dashboard as part of every pipeline deployment.

Involve domain experts earlier in prompt tuning. We wrote the initial extraction and fraud prompts based on our understanding of insurance claims processing — which, despite the onsite discovery day, was shallow compared to the adjusters' decade of experience. When we finally sat down with the lead adjuster to review prompt outputs line by line, she caught errors we had missed entirely: a medical term we were extracting incorrectly, a fraud indicator that was standard practice in the London repair market, a policy clause structure unique to UK motor insurance. The adjusters should have been in the room during prompt development, not brought in after deployment for quality review.

The pipeline runs today. The firm processes more claims, faster, with fewer errors. The adjusters still make every final decision. The AI handles the reading so they can focus on the thinking. That was always the point.