Custom LLM Evaluation Frameworks for Enterprise AI

An operator-grade pattern from the CreativeMinds Development (cmdev) AI engineering practice. Companion to the Amazon Bedrock for Production AI series and the prior piece on Air-Gapped LLM Deployments.

How LLM projects die at the three-month mark

The story is the same across the enterprise pilots that fail. Month one: the demo works. The model answers the test questions, the agent calls the right tools, the synthesis reads well. Month two: the team ships to a limited production audience. Most queries work, occasional weird answers get hand-corrected, and "we'll fix that in the next iteration" becomes the working motto. Month three: a customer-visible incident. The model produced a confidently wrong answer that landed in a regulated decision — a wrong refund, a missed compliance flag, a misclassified medical record, a synthesized summary that fabricated a critical line of context. The post-incident review finds no monitoring caught it. The team cannot tell the executive committee how many other queries this month went the same way. The pilot pauses. Sometimes it never restarts.

The failure mode is consistent because the cause is consistent: the team relied on vibe checks. Someone tried a dozen queries, the answers looked right, the project shipped. There was no harness, no golden set, no programmatic measurement of whether quality was holding as the model, the corpus, or the routing logic changed. The vibe-check approach scales linearly with engineering attention — and engineering attention is precisely the thing that becomes scarce when the project moves from build to operate.

The fix is eval-driven engineering: a measured loop where every change to the model, the prompts, the RAG pipeline, or the agent's tools is gated by a quality measurement against a stable test set. The harness runs in CI on every change. It runs in production continuously against a sample of live traffic. It runs as a regulatory-grade artefact for the operator's risk team. The cost of building it is one engineering sprint at the start; the cost of not building it shows up as a customer-visible incident at month three and never goes away.

This piece documents the evaluation architecture cmdev ships for regulated enterprise LLM workloads. It is the layer that converts the AI workload from a black box into a measured system. It is what makes the workload defensible to the CISO, the regulator, the audit committee, and the engineering team itself when something inevitably goes wrong.

The four kinds of evaluation that matter

A production LLM workload needs four distinct evaluation surfaces. Most teams build one of them, sometimes two. The deployments that survive 18 months in production have all four.

Type	What it measures	When it runs	Who consumes it
Offline eval	Quality on a stable golden set	On every code change (CI)	Engineering — gates merges
Online eval	Quality on live production traffic	Continuously, per-query sampling	Operations — feeds dashboards and alerts
Regression eval	Quality drift over time relative to baseline	Weekly or monthly	Engineering + risk — catches silent degradation
A/B eval	Quality of variant X vs variant Y on matched traffic	When testing a change	Product — gates rollouts

Each surface has its own metrics and its own audience. Conflating them is the most common architectural mistake — a team builds "an eval system" that mixes offline and online and ends up with one that does neither well. The mature pattern is four distinct pipelines that share a metrics vocabulary but run independently.

The metrics that actually predict production quality

The metric matters more than the model when you are buying enterprise AI. A model that performs at 92% on accuracy but produces opaque, uncitable answers is worse for regulated work than a model that performs at 87% on accuracy with full citations. The metric set we instrument for production:

For retrieval-augmented generation (RAG)

Metric	What it tells you	How we measure
Hit rate	Did the retrieval bring back the right chunk(s) for the question?	Per-query: was the expected source chunk in the top-K retrieved?
Mean reciprocal rank (MRR)	How high did the right chunk rank?	Per-query: 1 / position of expected chunk in retrieved set
Context precision	Of retrieved chunks, what fraction are actually relevant?	LLM-as-judge over each retrieved chunk against the query
Context recall	Of the chunks needed to fully answer, what fraction were retrieved?	LLM-as-judge against the golden answer's source set
Faithfulness	Does the answer follow from the retrieved context, or did the model hallucinate?	LLM-as-judge: extract claims from answer, check each is supported by context
Answer relevancy	Does the answer address the question that was asked?	Embedding similarity between question and answer summary
Citation accuracy	Do the citations in the answer point to chunks that actually support each claim?	Per-claim check against per-citation chunk

For agent workloads

Metric	What it tells you	How we measure
Tool call accuracy	Did the agent call the right tool for the task?	Compare called tool sequence to expected sequence per scenario
Tool argument accuracy	Did the agent call the tool with correct arguments?	Per-call: structural check against expected argument shape
Goal completion	Did the agent finish the task?	Per-scenario: terminal state matches expected state
Turn count	How many model-tool turns did the agent take?	Per-scenario: histogram tracked against baseline
Steering recovery rate	Of induced errors, how many did the agent recover from?	Per-injected-error scenario: did the agent self-correct?

Cross-cutting metrics that always matter

Latency distribution — P50, P95, P99 per query class. Latency is a quality signal because slow answers are worse than fast wrong answers for many use cases.
Cost per query — input + output tokens × model rate, aggregated per workload and per query class. Catches the runaway-cost regression before the bill does.
Refusal rate — fraction of queries where the model declines to answer or returns "I don't know." Worth tracking because refusal can creep in silently as Guardrails policies tighten.
Confidence calibration — when the model expresses confidence, how well does that confidence predict correctness? Critical for downstream decisions.

The metric set that actually works is opinionated and short. A team that instruments fifteen RAG metrics learns less than a team that instruments six and acts on them. The metrics above are the six-to-eight that we have found genuinely predict production quality across deployments.

The golden set — building it, maintaining it, retiring it

The harness is only as good as the golden set it runs against. Three rules that hold up:

1. Source from real users, not engineering imagination. The most common golden set failure mode is a test set that the engineering team wrote — full of plausible-but-not-actual queries. The model passes the engineering test set and fails on production traffic because the distributions don't match. The right move is to sample live traffic (anonymised), have the SME team or a small panel of users label expected outputs, and use that as the golden set. Refresh quarterly by pulling new traffic samples.

2. Cover the edge cases on purpose. A golden set that mirrors production traffic naturally under-represents edge cases — the rare queries where the model fails most spectacularly. Augment the golden set with deliberately difficult cases: ambiguous queries, queries that should produce a refusal, queries that test specific Guardrails policies, queries from low-volume topic clusters. The eval harness reports per-segment, so the team sees both "median production quality" and "edge case quality" as distinct numbers.

3. Version it, expire it, treat it like code. The golden set is in version control, gets PR review, has changelog entries when items are added or removed. Items get marked stale when the underlying ground-truth changes (a policy update means the old golden answer is no longer correct). A golden set that nobody touches is one that no longer reflects production reality.

The right size for a production golden set is 300-500 items for most workloads. Below 100, statistical noise overwhelms the signal. Above 1,000, the eval-run cost becomes a barrier to running it frequently. The 300-500 band is the sweet spot for daily CI-grade evaluation.

The harness — what it actually looks like in production

The architecture of a working LLM eval harness:

# eval/harness.py — the minimal shape

from dataclasses import dataclass
from typing import Callable
import boto3

@dataclass
class GoldenItem:
    id: str
    query: str
    expected_answer: str
    expected_chunks: list[str]
    segment: str  # for per-segment reporting
    difficulty: str  # easy / standard / hard / adversarial

@dataclass
class EvalResult:
    item_id: str
    actual_answer: str
    retrieved_chunks: list[str]
    metrics: dict[str, float]
    latency_ms: int
    cost_usd: float

def run_eval(
    golden_set: list[GoldenItem],
    pipeline: Callable[[str], dict],  # the system under test
    judge_model_id: str = "anthropic.claude-sonnet-4-6-20251022",
) -> list[EvalResult]:
    """Run the pipeline against each golden item, score with LLM-as-judge."""
    results = []
    for item in golden_set:
        start = time.monotonic()
        pipeline_output = pipeline(item.query)
        latency_ms = int((time.monotonic() - start) * 1000)

        metrics = {
            "hit_rate": hit_rate(pipeline_output["chunks"], item.expected_chunks),
            "mrr": mean_reciprocal_rank(pipeline_output["chunks"], item.expected_chunks),
            "faithfulness": llm_judge_faithfulness(
                question=item.query,
                answer=pipeline_output["answer"],
                context=pipeline_output["chunks"],
                judge_model=judge_model_id,
            ),
            "relevancy": embedding_similarity(item.query, pipeline_output["answer"]),
            "citation_accuracy": citation_accuracy_check(
                answer=pipeline_output["answer"],
                citations=pipeline_output["citations"],
                chunks=pipeline_output["chunks"],
            ),
        }
        results.append(EvalResult(
            item_id=item.id,
            actual_answer=pipeline_output["answer"],
            retrieved_chunks=pipeline_output["chunks"],
            metrics=metrics,
            latency_ms=latency_ms,
            cost_usd=pipeline_output["cost_usd"],
        ))
    return results

The shape is intentionally minimal. The complexity lives in the individual metric implementations and the LLM-as-judge prompts, not in the harness skeleton. The harness produces a structured result per item; the dashboard layer aggregates and trends those results.

The harness runs in three contexts:

In CI on every PR — gates merge if any metric regresses beyond a configured threshold against the baseline branch. Typical run: 300 items, 4-6 minutes wall clock, $1-3 in evaluation cost.
In production, sampling live traffic at 1-5% sample rate. The same metrics, against shadow-evaluated outputs (no LLM-as-judge on user-visible answers; offline judge on the sample).
In a regulatory artefact, monthly, against the full golden set, producing a signed PDF report with the metric trends and any drift findings. This is the artefact a CISO hands to an examiner.

LLM-as-judge — done properly

LLM-as-judge is the trick that makes RAG and agent evaluation possible at scale. The naïve version — "ask GPT/Claude to score this answer 1-10" — is unreliable. The version that works:

1. Structured rubric. The judge prompt specifies exactly what to score on, with anchored examples for each score band. Not "is this answer good?" but "for each claim in the answer, is it supported by the provided context? Score each claim independently."

2. Decomposition. Score components separately and combine programmatically. Faithfulness is computed per-claim, then aggregated. Citation accuracy is per-citation. The judge doesn't return a holistic score; it returns structured per-component data the harness aggregates.

3. Calibration against a human-labelled subset. A sample of golden items (typically 50-100) gets human SME labels for each metric. The judge model's outputs are compared against the human labels, and the calibration delta is tracked. If the judge drifts from human alignment, the rubric needs tuning or the judge model needs replacing.

4. Use a stronger model for judging than for the pipeline. Claude Sonnet for production reasoning; Claude Opus for the judge. The cost is acceptable because the judge runs on the golden set (300-500 items), not on production traffic. The quality difference matters because the judge defines the quality bar.

5. Cost-aware sampling. For online eval, judge a 1-5% sample of production traffic rather than every query. The sampling is stratified by query class so rare query types don't drop out of the measurement.

A working judge prompt for faithfulness:

You are evaluating whether an AI assistant's answer is faithful to its provided context.

Question: {question}
Context:
{retrieved_chunks}
Answer: {answer}

Extract every factual claim from the answer (claims, not opinions or hedges).
For each claim:
1. State the claim verbatim.
2. Identify the chunk(s) in context that support it (or "none").
3. Score: "supported" / "partially supported" / "unsupported" / "contradicted".

Return JSON:
{
  "claims": [
    {"claim": "...", "supporting_chunk_ids": [...], "score": "..."},
    ...
  ]
}

The harness aggregates: faithfulness score per item is the fraction of claims marked "supported" divided by total claims. Aggregated over the golden set, this becomes the per-eval-run faithfulness metric the dashboard tracks.

Continuous evaluation in production — the loop that catches drift

The hardest production failure mode is silent quality drift. The model didn't change, the prompts didn't change, the corpus didn't change — but quality drops anyway. Reasons include retrieval changes (a KB refresh changed what gets retrieved, sometimes via RAG poisoning), upstream model updates (Bedrock catalogue moves), changed user query distributions (a new product line creates new question patterns), and Guardrails policy tuning (a tighter policy increases refusal rates).

The pattern that catches drift:

Daily golden-set rerun in production environment, results compared against trailing 30-day baseline. Alarms on any metric regressing more than 2 standard deviations.
Shadow evaluation on 1-5% of live traffic, structured the same way as the golden set eval. Catches drift on query patterns the golden set doesn't cover.
Per-segment trending so regressions show up in specific user segments before they affect the headline metric.
Causal attribution — when a regression alarm fires, the harness reports which changes (deploys, KB ingestions, model version updates) correlate with the regression window. The on-call engineer gets a starting point, not just an alert.

The loop runs in the customer's account, on the customer's data, with results landing in the customer's S3 audit bucket. It is part of the air-gapped deployment from the prior article, not a separate cloud service.

The friction points — what bites in real deployments

Five frictions cmdev engineers have hit and engineered past:

1. Getting good golden-set labels at scale

The single biggest implementation blocker is producing high-quality expected outputs for 300-500 items. SMEs are busy; engineering's attempts produce engineering-thinking-like-engineering outputs that don't match SME thinking; outsourcing produces uneven quality.

The pattern that works: a two-pass labelling protocol. Pass 1: an LLM (Claude Sonnet, with the company's documentation as context) generates draft expected answers for each query. Pass 2: an SME reviews and either approves or corrects. SME time per item drops from ~10 minutes (writing from scratch) to ~2 minutes (reviewing the draft), and the corrections become training data for the next round of draft generation. A 400-item golden set becomes a 2-3 day SME project rather than a 2-3 week one.

2. LLM-as-judge cost surprises

A 400-item eval with claim-decomposition and per-citation checks can issue 10-20 judge calls per item. At Claude Opus rates, a single eval run costs $20-50. Running daily and on every PR adds up.

The cost optimisations: use Claude Haiku as a first-pass classifier to skip detailed judging on obviously-correct cases (high-similarity-to-golden-answer items); cache judge results when the pipeline output is identical to a previously-judged output (common during dev cycles); use batch inference (per Bedrock Series Part 7) for non-blocking eval runs. The combination drops eval cost by ~70% without measurable quality regression on the metric trend.

3. The "judge model drifts when Bedrock updates it" problem

The judge model is a moving target — Bedrock's catalogue updates, model versions get deprecated, and a judge model that was calibrated to humans six months ago may not be calibrated to the same standard today. We hit this in a production deployment where faithfulness scores trended upward over a quarter — not because the pipeline improved, but because the judge became more permissive after a model version change.

The fix: pin the judge model version, just like the pipeline model. When the pinned judge approaches deprecation, run a parallel calibration against the successor on the labelled subset, document the calibration delta, and switch only when the calibration is documented. Track the judge version in every eval-run metadata record so historical trends are interpretable.

4. Evaluating multimodal outputs

When the model produces images, charts, or formatted documents, text-based eval metrics break down. There is no LLM-as-judge approach for "did the chart correctly visualise the data."

The patterns that work: deterministic checks on structured outputs (schema validation, range checks, expected-element-present checks); LLM-as-judge against extracted text from images for charts with labels; human-spot-check sampling for genuinely visual outputs. The honest answer is that multimodal eval is harder and we accept a higher human-in-the-loop burden for those workloads.

5. The drift alarm that nobody acts on

The hardest organisational problem is not building the harness — it's building the response loop. A drift alarm at 2 AM that nobody triages is worse than no alarm because it normalises ignoring the signal. We have shipped harnesses where the team built it brilliantly and then never looked at the dashboard.

The pattern that closes this: drift alarms route to the same PagerDuty / on-call channel as service-degradation alarms; the runbook for each alarm class is documented; the post-incident review when a drift alarm fires includes the same rigour as a customer-visible incident. The harness has organisational gravity, or it is decoration.

What this taught us about enterprise scaling

Five things hold up across the eval deployments cmdev has shipped:

1. The eval harness is the deployable artefact. The model is fungible — Bedrock's catalogue moves, customer preferences shift between Claude and Llama, retrieval architectures evolve. The eval harness is the layer that says "regardless of what is underneath, here is the measured quality." It is the operator's most durable AI engineering investment.

2. The first month of eval data is more valuable than the first six months of model improvements. A team that ships a working eval harness in week 4 of a deployment will outperform a team that ships a "better model" in month 6, because the team with the harness can iterate measurably. The team without it iterates blind.

3. The golden set is a regulatory artefact, not just a dev tool. When a regulator asks "how do you know your AI is producing quality outputs?" the golden set + the harness + the trended metrics are the answer. Without them, the answer is hand-waving. With them, the answer is a 12-page report that an examiner signs off in one read.

4. LLM-as-judge needs the same engineering rigour as the pipeline. Calibrated rubrics, version-pinned judge models, cost discipline, drift detection. Teams that treat the judge as an afterthought get unreliable evaluation; teams that treat it as a first-class engineering surface get the measurement they need.

5. The cost of building an eval harness is ~10% of the cost of the AI workload it measures. The cost of not building it is the cost of a customer-visible incident. That trade is always favourable. The teams that don't make it are the ones that haven't worked out the economics yet.

Engaging with cmdev

CreativeMinds Development (cmdev) is the engineering studio behind this evaluation framework. We ship measurable, audit-defensible AI for regulated enterprises in Africa and the EU — banks under CBN CSAT, energy operators under NMDPRA and NIS2, fintechs under NDPA, healthcare networks under HIPAA-equivalent regional regimes. The eval harness is part of the production architecture we deploy, not an afterthought bolt-on.

Email: [email protected]
Cloud security services: /services/cloud-security
Companion architecture series: Amazon Bedrock for Production AI, Air-Gapped Bedrock, AWS-for-banks

Mayowa Adewole is CTO and Principal AI Engineer at CreativeMinds Development. He leads cmdev's AI engineering practice for regulated enterprises across Africa and the EU, with deployments in production for banking, energy, and critical-infrastructure customers.