The email that almost went out
An agent is processing a backlog of customer support tickets. The seventh ticket of the morning is a refund request — straightforward, the kind the agent has handled hundreds of times. The agent reads the ticket, queries the order database, calculates the refund amount, drafts the email, and is one tool call away from sending it.
Except the customer ID in the ticket was malformed. The agent's retrieval pulled the wrong order. The refund amount is ten times the correct figure. The email is addressed to a customer who never asked for a refund. Every step the agent took was internally consistent. Every reasoning step looked sound. The action is one send_email invocation away from being a real event that the company has to call the customer to walk back.
What stops that email from going out is not a better prompt and not a smarter model. It is an approval gate — a layer between the agent's intent and the world, designed so that the actions that matter pause for a check, while the actions that do not keep moving.
This is the piece of the architecture most teams skip. The result is one of two failure modes, and both are bad.
The two failure modes that bracket the design space
The first failure mode is no approval gates at all. The agent runs free. Every tool call it decides to make executes immediately. For a while, this looks like productivity — the agent ships work, the velocity numbers look good, the demos go well. Then the first incident happens. An email goes to the wrong recipient, a record is overwritten in production, a payment is issued for the wrong amount, a customer's data is deleted because the agent matched a stale identifier. The incident review reveals that the agent's reasoning chain was confident, internally consistent, and wrong. The team adds a layer of "human review" after the fact, but the damage is already done and trust in the agent collapses.
The second failure mode is human approval on every action. The team, chastened, requires a person to sign off on every tool call the agent makes. For a week, this feels safe. By week two, the reviewers are rubber-stamping requests faster than they can read them. By week three, they are reviewing a thousand actions a day across multiple agents, none of which they have meaningful context on. The agent's velocity has collapsed, the reviewers are exhausted, and the approval queue has become exactly the thing the agent was supposed to eliminate. Worst of all, the approval has become theater — the reviewers approve so reflexively that the next bad action goes through anyway, with a human's name attached to it.
The mature pattern sits between these. Not every action is dangerous. Not every action is safe. The job of the approval layer is to classify, route, and surface — automating the safe path, validating the medium path, and reserving humans for the actions where their judgment actually changes outcomes.
The risk classification layer
Every tool in the agent's manifest is tagged with a risk level at design time, not at runtime. This is a deliberate choice. Risk is a property of the action, not the prompt, and a model deciding its own risk classification is exactly the kind of self-graded homework that does not survive an adversarial input.
The classification we use has three tiers.
LOW-RISK actions are read-only, idempotent, or strictly internal. Querying a database, searching a document store, fetching a record, looking up a calendar, reading a thread of past messages. These execute automatically. There is no policy check, no human approval, no queue. The agent calls the tool, gets the result, and continues. If the agent does a thousand of these in a workflow, that is fine — none of them touch the world in a way that cannot be undone or that costs anything beyond compute.
MEDIUM-RISK actions are writes to internal systems where the blast radius is contained and the action is reversible or auditable. Updating an internal record, creating a draft email, scheduling an internal calendar event, writing a note to a CRM, adjusting a workflow status. These do not execute automatically. They pass through a policy engine, which evaluates the action against deterministic rules. If the rules pass, the action executes without a human in the loop. If the rules fail, the action is either denied or escalated to a human depending on the failure type.
HIGH-RISK actions are external communications, irreversible operations, financial actions, and anything that touches a customer or a production system in a way that cannot be undone with a follow-up call. Sending an external email, issuing a refund above a threshold, deleting a record, modifying a production database, posting to a public channel, triggering a payment. These always require explicit human approval. There is no path around this — the policy engine cannot approve them, and no amount of clean track record by the agent changes the rule.
The classification is encoded in the tool definition itself, not in a separate config file that drifts. When the Salesforce MCP server exposes update_record, the tool descriptor carries risk: medium. When the Gmail server exposes send_email, the descriptor carries risk: high for external recipients and risk: medium for internal ones, with the distinction made at the tool layer by checking the recipient's domain against the tenant's domain. The agent never sees these tags — they are infrastructure, applied to the agent's outbound tool calls before execution.
The policy engine is rules, not another LLM
The policy engine sits in the medium-risk path. Its job is to take a structured action request and decide, in deterministic terms, whether the action is allowed.
We do not implement the policy engine as an LLM call. This is a discipline. The whole point of the policy layer is that it is debuggable, replayable, and predictable. An LLM-based policy check inherits every problem the agent has — prompt injection, hallucination, non-determinism — and applies it to the very layer that is supposed to constrain those problems. The policy engine is rules-based code. It is boring on purpose.
The rules are layered.
Per-tool policies describe what the tool itself permits. The email tool can send up to 100 emails per day per tenant. The email tool never sends to a domain on the blocklist. The database write tool can modify records owned by the agent's tenant and no others. The refund tool can process refunds up to a configured ceiling. These are absolute rules — the agent cannot reason its way past them.
Per-tenant policies describe what this specific client's deployment permits. One tenant's policy says the agent can write to staging tables but never production. Another tenant's policy says the agent can send emails only between 8am and 6pm in the tenant's local timezone. A third tenant disables the deletion tool entirely. These policies are configured at deployment, not at runtime, and a tenant administrator can adjust them without touching the agent code.
Per-action-context policies are the most useful and the easiest to get wrong. They evaluate the specific action against contextual rules. The agent can refund up to $500 without approval; above that, a human is required. The agent can update customer records, but a change to a field tagged as sensitive (email, phone, address) requires the customer's last-modified date to be older than seven days. The agent can schedule meetings, but only on calendars where the calendar owner has opted in. These rules live in a structured policy file, version-controlled, reviewed like any other code, and tested.
When a medium-risk action is requested, the policy engine evaluates all three layers. If any layer denies, the action is denied with a structured reason. If any layer requires escalation, the action is routed to a human regardless of what the other layers say. If all layers pass, the action executes and the result is logged.
This is the layer that turns "the agent might do something unexpected" into "the agent might attempt something unexpected, and we will see it in the policy log." The difference is the entire safety story.
The high-risk approval flow
For high-risk actions, the gate is a structured handoff to a human, not a free-form "do you approve this?" notification. The quality of the approval depends entirely on the quality of the context the human receives, and almost every approval system gets this wrong by surfacing raw model output instead of decision-ready context.
When the agent prepares a high-risk action, it does not execute the tool call. Instead, it constructs an approval request — a structured object that captures everything a human needs to make the decision in under a minute. The request includes the actor (which agent, on which workflow, for which tenant), the action (which tool, with which parameters), the impact (a deterministic summary computed from the parameters — who will receive the email, how much money will move, which record will change), the reasoning (the agent's own explanation, but trimmed and surfaced as a separate field rather than mixed in with the action itself), and the alternatives the agent considered but rejected.
A concrete example from a refund workflow, after policy evaluation has flagged the action as requiring approval:
{
"request_id": "ar_01HW9Z8KQX3M2P4R6V7Y8Z9A0B",
"tenant_id": "acme-fintech",
"workflow_id": "wf_support_q2_backfill_8742",
"agent": "support-agent-v3",
"risk": "high",
"action": {
"tool": "issue_refund",
"parameters": {
"customer_id": "cust_4471",
"order_id": "ord_9923871",
"amount_cents": 287400,
"currency": "NGN",
"reason_code": "service_disruption"
}
},
"impact": {
"customer_email": "[email protected]",
"amount_display": "NGN 2,874.00",
"destination_account_last4": "8819",
"irreversible": true,
"downstream_actions": ["confirmation_email", "ledger_entry"]
},
"reasoning": {
"summary": "Customer reported 3-hour outage on May 22 affecting paid tier. Service log confirms incident SLO breach. Policy entitles 25% of monthly fee.",
"evidence": [
"ticket_id: tk_55129 (customer complaint, May 22 14:03)",
"incident_id: inc_q2_812 (3h17m downtime, paid-tier impact confirmed)",
"policy_doc: refund_policy_v4 §3.2 (SLO breach refund schedule)"
],
"alternatives_considered": [
{
"option": "service_credit_only",
"rejected_because": "customer is month-to-month, credit expires before next billing cycle"
},
{
"option": "escalate_to_human_without_recommendation",
"rejected_because": "all evidence supports refund per policy; no ambiguity to escalate"
}
]
},
"policy_trace": {
"tool_policy": "pass (within tool ceiling: NGN 50,000)",
"tenant_policy": "pass (tenant has refunds enabled)",
"context_policy": "escalate (amount > NGN 1,000 threshold requires human)",
"decision": "human_approval_required"
},
"expires_at": "2026-05-25T14:30:00Z"
}
The reviewer sees this object rendered into a clear UI: the impact at the top, the reasoning collapsed but expandable, the policy trace as a sidebar. They can approve, reject, or modify (lowering the refund amount, for instance, is a modification — the action goes through with adjusted parameters). Every outcome is logged with the reviewer's identity, the timestamp, and a rationale that the reviewer types in. That rationale becomes training data for the next iteration of the policy rules.
The handoff has an expiry. If a human does not respond within a configured window, the action is automatically denied and the workflow is paused for follow-up. Approvals do not silently expire into yes.
Trust evolves with track record
The thresholds in the policy engine are not static. The same agent, on the same workflow, with two months of clean operation, can have its medium-risk ceiling raised — more actions auto-execute, fewer escalate to humans. The same agent after an incident has its ceilings cut — more actions escalate, the audit team gets eyes on more of the work.
This is encoded explicitly rather than implicitly. The policy file carries thresholds parameterized by a trust score, which is itself a deterministic function of the agent's recent track record: number of actions executed, number that triggered policy escalations, number that produced incidents, time since last incident. When an incident occurs — a refund issued for the wrong amount, a record updated in error, a customer email sent that should not have been — the trust score drops and the thresholds tighten. When the agent has run for a defined period without incident, the trust score recovers and the thresholds widen.
The point is not that the system gets more permissive over time. The point is that the system has a explicit, reviewable answer to "how do we know it is safe to let this agent do more on its own?" The answer is a number, computed from data, that maps onto policy ceilings. When a tenant administrator asks why the agent is suddenly escalating more requests, the answer is a chart, not a vibe.
The audit and replay layer
Every action — auto-executed low-risk calls, policy-approved medium-risk calls, human-approved high-risk calls — is logged with full context. The log entry carries the structured action request, the policy trace, the decision (auto, policy, human), the outcome of the tool call, any errors, and the reasoning trace the agent produced.
This is not optional, and it is not just an audit requirement. It is the substrate that lets the team learn from the agent's behavior. Post-incident, the team can replay the decision chain step by step: which tool was called, what the policy engine saw, what the reasoning was, what the alternative actions were, what the outcome was. The replay produces the diagnostic question — was this an agent reasoning failure, a policy gap, a missing tool restriction, or a genuine edge case that the policies were never designed to cover?
Each answer points to a different fix. Reasoning failures are addressed with better grounding, narrower context, or model upgrades. Policy gaps are filled by adding rules. Missing tool restrictions are added at the tool layer. Genuine edge cases are routed to humans permanently until enough examples accumulate to justify a new rule. Without the audit trail, the team is guessing about all four.
Where this sits in the broader safety story
The approval gate is not a standalone control. It is one of several layers that, together, decide whether an agent running on real systems is a capability or a liability.
Retrieval grounding decides what the agent knows, and narrows the surface area where the reasoning can drift. Prompt injection defenses decide whether instructions hidden in retrieved content or user input can hijack the agent's decision-making — and the approval gate is the last line of defense when those upstream layers miss an attack, because even a successfully injected instruction has to pass the policy engine and, for high-risk actions, a human. The AWS security posture decides who can invoke the agent and what data it can touch in the first place. The OpenClaw architecture is the substrate that ties these layers together.
The approval gate's specific contribution is to the moment of action — the instant where the agent stops thinking and starts doing. Everything upstream is about making good decisions. The approval gate is about making sure that even when a bad decision slips through, it does not become a bad outcome.
The point is not to slow the agent down. The point is to make the agent fast and safe at the same time, by routing each action to exactly the level of scrutiny it deserves. Low-risk actions never wait. High-risk actions never go through without a human who saw the right context. Medium-risk actions run against rules that the team can read, change, and reason about. That is the human-in-the-loop pattern that scales — not a human on every action, and not a human on no actions, but a human exactly where their judgment is the cheapest way to prevent the worst outcomes.
