Case Study: An SRE AI Agent on Bedrock for CloudWatch Log Triage

Series · Amazon Bedrock for Production AI · Part 8 of 8 ← Part 7: Cost Optimization · Case Study: SRE AI Agent for CloudWatch Log Triage

What this piece is

Parts 1-7 documented the architectural surfaces of building production agents on Amazon Bedrock. This piece does the opposite move — takes one specific use case and walks the full architecture end-to-end. The premise: a 2 AM page about a microservice degradation, with a working SRE AI agent that triages the issue, identifies the failing component, and either takes a remediation action or escalates to the human on-call. The agent runs on Strands + AgentCore, uses Claude Sonnet for reasoning and Haiku for the routing layer, retrieves log context via Cohere embeddings into a Knowledge Base, executes through Lambda action tools, and gates every destructive action through event.interrupt().

Every architectural decision is traceable to a prior piece in the series. Every cost optimisation from Part 7 is applied. Every Guardrail from Part 6 is wired. The case study is the integration test.

The problem worth solving

An SRE team operating a moderately complex microservices workload — say, a 50-service e-commerce stack — fields somewhere between 5 and 30 incident pages per week. Most pages are not actual incidents: they're transient errors, retries that succeeded, latency spikes that resolved themselves. The on-call human triages: open CloudWatch, search the affected service's logs, look for the error pattern, decide what to do.

The triage takes 5-15 minutes of focused attention even for routine pages. At 2 AM, focused attention is in short supply. The cost is not the human time — it's the cognitive tax that erodes the on-call rotation's morale and the slower mean-time-to-resolution on the small subset of pages that are real incidents because attention was burned on triaging the false positives.

An AI agent that handles the triage tier — gathers the relevant logs, classifies the page as transient or real, takes a safe automatic action where one applies, and only pages the human for genuine incidents — is the right shape of automation for this surface. Not autonomous IT operations; tier-1 triage that reliably reduces the human surface to what actually requires judgement.

The reference architecture

The component breakdown:

Trigger: EventBridge rule subscribed to CloudWatch Alarms in ALARM state
Workflow: Step Functions Express workflow (per Part 5) — sub-5-minute typical execution
Router: Claude Haiku classifier as the first state — decides whether to investigate or page immediately
Agent: Strands SDK agent (per Part 3) hosted on AgentCore Runtime for the long-running tasks
Foundation model: Claude Sonnet 4.6 pinned ID for reasoning; Haiku for routing and classification subtasks
Action groups: four Lambda functions, each with a Powertools-generated OpenAPI schema (per Part 5)
Knowledge Base: company runbook corpus on Cohere Embed v3 + OpenSearch Serverless (per Part 2)
Guardrails: denied-topic policy preventing destructive production actions without event.interrupt() gate; PII filter; output grounding check
Observability: full stack per Part 6
Cost: cascade routing per Part 7; per-state cost tags

The agent definition

The Strands agent at the heart of the system:

from strands import Agent, tool
from strands.hooks import BeforeToolCallEvent, AfterToolCallEvent
from strands.memory import SlidingWindowConversationManager
import boto3

logs = boto3.client("logs")
ecs = boto3.client("ecs")
codedeploy = boto3.client("codedeploy")

# ---- Action tools ----------------------------------------------------------

@tool
def query_cloudwatch_logs(
    log_group: str,
    query: str,
    minutes_back: int = 60,
) -> dict:
    """Run a CloudWatch Logs Insights query against the named log group.

    Use this to retrieve log events matching a pattern, count errors,
    or identify the time-window of an anomaly. The query syntax is
    standard CloudWatch Logs Insights — `fields @timestamp, @message |
    filter @message like /ERROR/ | sort @timestamp desc | limit 100`.

    Args:
        log_group: full log group name (e.g. /aws/lambda/checkout-api)
        query: CloudWatch Logs Insights query string
        minutes_back: time window from now (default 60 min)
    """
    end_time = int(time.time())
    start_time = end_time - (minutes_back * 60)

    response = logs.start_query(
        logGroupName=log_group,
        startTime=start_time,
        endTime=end_time,
        queryString=query,
    )
    # poll for results — abbreviated here
    return wait_for_query_results(response["queryId"])


@tool
def get_recent_deployments(service_name: str, hours_back: int = 24) -> list[dict]:
    """Return recent deployments for the named service.

    Useful when investigating whether a recent deployment correlates
    with the observed errors — common root cause for sudden degradation.
    """
    response = codedeploy.list_deployments(
        applicationName=service_name,
        createTimeRange={
            "start": datetime.utcnow() - timedelta(hours=hours_back),
            "end": datetime.utcnow(),
        },
    )
    return [_format_deployment(d) for d in response["deployments"]]


@tool
def restart_ecs_service(cluster: str, service: str) -> dict:
    """Restart the named ECS service (forces a new deployment with no task definition change).

    DESTRUCTIVE ACTION: this will terminate running tasks. Requires
    human approval via event.interrupt() before execution.

    Use only when investigation indicates the service is in a degraded
    state recoverable by restart (memory leak, stuck connections, etc.)
    and not when the root cause is a recent deployment (use rollback
    instead).
    """
    return ecs.update_service(
        cluster=cluster,
        service=service,
        forceNewDeployment=True,
    )


@tool
def rollback_deployment(application: str, deployment_id: str) -> dict:
    """Roll back the named application to the previous successful deployment.

    DESTRUCTIVE ACTION: this will revert production code. Requires
    human approval via event.interrupt() before execution.

    Use when the observed errors correlate strongly with the most
    recent deployment.
    """
    return codedeploy.stop_deployment(deploymentId=deployment_id, autoRollbackEnabled=True)


@tool
def escalate_to_on_call(
    summary: str,
    severity: str,
    suggested_actions: list[str],
) -> dict:
    """Page the on-call engineer with a triage summary.

    Use when investigation indicates a real incident requiring human
    judgement, or when the agent's confidence in autonomous action is low.
    """
    return pagerduty_create_incident(
        summary=summary,
        severity=severity,
        details={"agent_analysis": suggested_actions},
    )


# ---- Hooks for audit -------------------------------------------------------

def audit_tool_call(event: BeforeToolCallEvent):
    """Every tool call lands in CloudTrail-equivalent custom audit log."""
    audit_log({
        "agent": event.agent_id,
        "session": event.session_id,
        "tool": event.tool_name,
        "args": event.tool_args,
        "timestamp": datetime.utcnow().isoformat(),
    })


# ---- Steering: gate destructive actions ------------------------------------

def gate_destructive_actions(event: BeforeToolCallEvent):
    """No destructive action runs without human approval."""
    DESTRUCTIVE = {"restart_ecs_service", "rollback_deployment"}

    if event.tool_name in DESTRUCTIVE:
        approval = event.interrupt({
            "type": "approval_required",
            "tool": event.tool_name,
            "args": event.tool_args,
            "agent_reasoning": event.context.get("reasoning_trace", ""),
            "incident_id": event.context.get("incident_id"),
            "channel": "pagerduty",
            "timeout_seconds": 300,
        })

        if not approval.get("approved"):
            return {"redirect": (
                f"On-call declined the {event.tool_name} action. "
                "Continue investigation and escalate with detailed analysis instead."
            )}


# ---- Agent definition ------------------------------------------------------

SRE_AGENT_INSTRUCTION = """
You are an SRE triage agent. You receive CloudWatch Alarm events and your
job is to determine whether the alarm represents a real incident requiring
remediation or a transient issue that has self-resolved.

Your workflow:
1. Query the affected service's logs over the relevant time window.
2. Check recent deployments — sudden errors after a deployment correlate
   strongly with bad code.
3. Consult the runbook knowledge base for the specific error pattern.
4. Decide on an action:
   - If errors are clearly transient and have stopped: report and close.
   - If errors correlate with a recent deployment: recommend rollback
     (requires approval).
   - If the service is in a stuck state recoverable by restart: recommend
     restart (requires approval).
   - If the cause is unclear or the impact is high: escalate to on-call.

For every action recommendation, cite the specific log events that
support your conclusion. Do not act on assumptions; act on evidence.
Be conservative: when in doubt, escalate.
"""

sre_agent = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",
    instruction=SRE_AGENT_INSTRUCTION,
    tools=[
        query_cloudwatch_logs,
        get_recent_deployments,
        restart_ecs_service,
        rollback_deployment,
        escalate_to_on_call,
    ],
    hooks=[
        audit_tool_call,
        gate_destructive_actions,
    ],
    conversation_manager=SlidingWindowConversationManager(recent_turns=20),
    knowledge_base_id="KB-runbooks-prod-001",
    guardrail_id="GR-sre-agent-prod-001",
)

What the code shows operationally:

Pinned Claude Sonnet for the agent's reasoning. The model ID is a specific dated version per the rule from Part 1.
Tools with clear OpenAPI-style docstrings that the model reads to decide which tool to call. Docstring quality is prompt engineering.
Hooks for audit and safety. Every tool call is logged before execution; every destructive action gates through event.interrupt().
Steering rather than blocking for declined actions. The agent gets the redirect message and proceeds with investigation, not stuck waiting.
Knowledge Base and Guardrails referenced by ID, not inlined.

An incident scenario — walked through

A real shape of incident the agent handles in production. Names changed.

2:04 AM — CloudWatch alarm fires on the checkout-api service: error rate over 5% for 3 minutes.

2:04 AM — EventBridge delivers the alarm to the Step Functions workflow.

2:04 AM — State 1: Haiku classifier reads the alarm metadata, including the service name and the 5% error rate over 3 minutes. Output: "investigation required". Cost: ~$0.0008.

2:05 AM — State 2: SRE Agent invoked on AgentCore Runtime. The agent's first action: call query_cloudwatch_logs with /aws/ecs/checkout-api log group and the query fields @timestamp, @message | filter @message like /ERROR/ | sort @timestamp desc | limit 100, time window 30 minutes.

2:05 AM — query_cloudwatch_logs returns 73 ERROR events, all with the message ConnectionTimeoutException: PostgreSQL connection pool exhausted. The errors started at 2:01 AM.

2:05 AM — The agent calls get_recent_deployments for checkout-api, 24-hour window. Returns: deployment dep-9af3 at 2:00 AM, status Succeeded. The deployment correlates exactly with the error onset.

2:05 AM — The agent retrieves from the runbook Knowledge Base on the query "PostgreSQL connection pool exhausted after deployment." Top retrieved chunk: a runbook noting that recent migrations to the new connection-pool library require explicit pool.close() calls in the request handler; a previous incident with the same signature was caused by a missing close call.

2:06 AM — The agent's reasoning trace: "Errors began precisely at the time of deployment dep-9af3. The error signature matches a known runbook scenario for missing connection close calls. Recommendation: rollback dep-9af3. Confidence: high."

2:06 AM — The agent calls rollback_deployment(application="checkout-api", deployment_id="dep-9af3"). The hook fires; event.interrupt() opens a PagerDuty incident with the agent's reasoning and waits for approval.

2:07 AM — On-call engineer's phone buzzes. The PagerDuty alert shows: "SRE agent recommends rollback of dep-9af3 due to connection-pool exhaustion correlating exactly with deployment time. Confidence: high. Approve / Decline." The engineer reads the agent's reasoning, sees the log evidence cited, sees the runbook hit, and taps Approve.

2:07 AM — The rollback executes. CodeDeploy initiates the reversion.

2:09 AM — The rollback completes. CloudWatch alarm transitions to OK within 90 seconds of rollback completion.

2:10 AM — The agent writes a brief postmortem to the incident channel and closes the workflow execution. Total wall time: 6 minutes. Total on-call human time: 90 seconds (reading the alert and approving).

For comparison: the same incident triaged by a human from scratch typically takes 15-25 minutes — opening the logs, searching, identifying the deployment correlation, checking the runbook, deciding to roll back, executing. The agent compressed the cognitive work, not the safety. The human stayed in the loop for the destructive action.

What the agent does not do

Worth stating explicitly because the "autonomous AI ops" framing oversells what's actually defensible in 2026:

No production database mutations. The agent has no tools that write to production databases directly. Restart and rollback are recoverable; arbitrary writes are not.
No security-policy changes. IAM modifications, security-group updates, KMS-key changes — all out of scope. These require human deliberation, not agent autonomy.
No external API calls beyond the strict tool list. The agent cannot reach the public internet, cannot call third-party APIs, cannot do unbounded actions. The tool list is the entire action surface.
No multi-incident coordination. Each incident workflow is independent. Cross-incident pattern detection happens in a separate analytics pipeline, not in this agent.

The boundary is intentional. An agent that can do more, would do more — including the wrong thing more.

Cost analysis — what running this actually costs

Per-incident cost breakdown for the scenario above (illustrative; numbers will vary by region and current pricing):

Component	Detail	Cost
Haiku classifier	~400 input tokens, ~10 output tokens	~$0.0008
Sonnet reasoning (agent loop)	~4,000 input tokens (system prompt + tool descriptions + cached), ~1,200 output tokens, across 4 turns	~$0.06
Cohere Rerank on Knowledge Base	~25 documents scored	~$0.005
Cohere embeddings (query)	~30 tokens	~$0.00005
Step Functions Express	6 state transitions over 6-minute execution	~$0.0001
Lambda action tools	4 invocations, ~500ms each	~$0.0002
CloudWatch Logs Insights	1 query over 30-min window	~$0.005
Guardrails	Input + output filter on each model call	~$0.0015
Per-incident total		~$0.075

Seven and a half cents per triaged incident. For a workload that triages 200 incidents a day, that's $15/day in inference cost — roughly $5,500/year. A human on-call engineer's cognitive tax on the same workload is materially higher.

The cost optimisations from Part 7 applied here:

Cascade routing: 80%+ of alarms classify as "transient probable" by Haiku and resolve without reaching Sonnet. Average per-incident cost is well below the ~$0.075 calculated for the investigation case.
Prompt caching: the agent's system prompt (~3,500 tokens including tool descriptions) is cached, dropping the recurring input cost to ~10% of nominal.
Top-K discipline: the Knowledge Base retrieval returns top-3 chunks after re-ranking, not top-20.
Express Step Functions: cheaper than Standard for sub-5-minute executions.

What we'd revisit on the next iteration

Five things the deployment surfaces that a second iteration would address:

Multi-region failover for the agent itself. The agent runs in one region; a regional outage takes it down with the workload it's supposed to triage. The fix is the standard multi-region pattern from the AWS architecture series, applied to the agent.
Cross-incident memory. The agent forgets between incidents. Adding AgentCore Memory with cross-session recall lets the agent learn from prior incidents (this deployment caused issues last week too) without retraining.
Tighter integration with deployment systems. Calling rollback_deployment works; calling the team's actual CI/CD system with the deployment context (commit SHA, author, PR link) would make the on-call's approval decision faster.
Domain-specific fine-tuned classifier. The Haiku classifier handles most alarms but mis-routes ~5% — usually edge cases specific to the team's services. A small fine-tuned classifier (per Part 4) trained on the team's historical alarms would tighten that.
Steering handler for log-query safety. The agent occasionally writes inefficient CloudWatch Logs Insights queries that scan more data than needed. A steering handler that catches expensive queries and redirects the agent to more efficient patterns would cap the per-incident CloudWatch cost.

What this case study demonstrates

The series's central thesis was: production AI on Bedrock is a layered architecture, not a single model call. The case study makes the layering literal — every architectural surface from Parts 1-7 appears in the running system. The result is an agent that does useful work at production volume, under safety constraints, with economic discipline.

The same pattern composes to other use cases. The SRE triage agent's shape generalises to compliance-evidence agents, customer-support copilots, document-review agents, research synthesisers — change the tool set, the system prompt, the Knowledge Base, the routing thresholds. The architecture stays.

What this case study is not

It is not a demonstration that "AI can replace SREs." It is a demonstration that AI can absorb the tier-1 triage tax that erodes SRE attention. The humans on-call still own the destructive decisions, the post-incident reviews, the escalation paths, and the architectural ownership of the workload. The agent absorbs the cognitive cost of reading logs at 2 AM.

That trade — preserve human judgement on the calls that matter, automate the cognitive overhead on the ones that don't — is the right shape of AI deployment in 2026 for SRE work specifically and operational work generally. The architecture in this series is what makes it shippable.

Closing the series

This piece closes the eight-part Amazon Bedrock series. The series:

Part 1 — Foundations: Building AI Agents on Amazon Bedrock
Part 2 — RAG with Bedrock Knowledge Bases
Part 3 — Open-source Agent Frameworks on Bedrock
Part 4 — Model Customization on Amazon Bedrock
Part 5 — Multi-step AI Workflows with Step Functions and Bedrock
Part 6 — Security Guardrails and Observability for Bedrock
Part 7 — Cost Optimization on Bedrock
Part 8 — Case Study: An SRE AI Agent on Bedrock for CloudWatch Log Triage (this piece)

The substrate this AI series builds on:

The AWS-for-banks architecture series — the cloud foundation underneath
The Hardening-before-AWS series — the application and identity controls that production AI assumes are in place
The DSI banking threat picture and oil & gas threat picture — the analytical context that informs why this matters

Production AI on Bedrock is no longer experimental in 2026 — the architectural patterns documented across this series are what defensible enterprise deployments look like in practice. The agents that ship are the ones built with the discipline these patterns encode.