Open-Source Agent Frameworks on Bedrock: Strands, LangChain, LlamaIndex and the Managed-vs-Open-Source Decision

Series · Amazon Bedrock for Production AI · Part 3 of 8 ← Part 2: RAG with Bedrock Knowledge Bases · Open-source Agent Frameworks · Part 4: Model Customization on Bedrock →

The decision Bedrock Agents forces

When you reach for Bedrock Agents from Part 1, the managed loop solves a real problem: the orchestration is handled, the action-group plumbing is wired, the audit trail is automatic. The trade-off — and it is a real trade-off, not a tagline — is reduced control over the loop. You cannot intervene mid-turn to redirect the model. You cannot insert custom validation between a tool call's response and the next reasoning step. You cannot run an evaluation harness against the agent's reasoning trace because you do not own the reasoning trace. The agent is opaque by design.

For an estimated 60-70% of production agent use cases — well-defined task surface, modest customisation needs, AWS-native deployment — Bedrock Agents is the right answer. For the other 30-40%, the work has shape that needs the agent loop in your own code. This piece is about that other 30-40%.

The three credible paths into that space:

AgentCore Runtime + Strands SDK — AWS's own open-source SDK paired with the AgentCore managed runtime. Production-grade hooks, steering handlers, multi-agent patterns, observability built in. AWS-native deployment.
LangChain or LlamaIndex — the framework-agnostic options. Broadest ecosystem, oldest documentation, most cross-cloud portability, most plumbing the team has to operate.
Custom agent loop in Python or TypeScript — calling Bedrock's Converse API directly. Maximum control, maximum work. The right answer only when the use case is genuinely novel.

The defensible default for most production AWS-native work is AgentCore + Strands. The reasoning is the rest of this piece.

The framework comparison

The matrix that matters in practice:

Dimension	Bedrock Agents	AgentCore + Strands	LangChain / LlamaIndex
Control over the loop	None (managed)	Full (open-source)	Full (open-source)
Time to first working agent	Minutes	Hours	Hours to days
AWS-native observability	Built-in (CloudWatch / X-Ray)	First-class (AgentCore traces, OpenTelemetry)	Manual wiring
Multi-agent orchestration	Limited	Built-in: Agent-as-Tool, Swarm	Manual or via add-on libraries
Steering / runtime correction	Not available	Steering handlers built-in	Manual via callbacks
Human-in-the-loop	Limited	`event.interrupt()` primitive	Manual implementation
Cross-cloud / cross-vendor portability	None	Bedrock-coupled by default; portable with effort	Full portability across providers
Ecosystem breadth (integrations, plugins)	AWS-only	Growing, AWS-led	Largest in the space
Production maturity (2026)	Mature	Mature, AWS-backed	Mature, community-led
The right answer when…	Single-task, well-defined, fast-start	Production AWS-native with control	Cross-cloud, framework-portable, or existing investment

Why Strands earns the centre column

Strands is, narrowly, the AWS-backed open-source agent harness SDK. The marketing says "control end-to-end"; the engineering reality has four pieces that earn that claim and matter in production.

1. The model-tool-prompt loop is in your code

The minimal Strands agent is short:

from strands import Agent, tool
from pathlib import Path

@tool
def save_report(title: str, content: str) -> str:
    """Save a research report to disk."""
    path = f"reports/{title}.md"
    Path(path).write_text(content)
    return f"Saved {path}"

agent = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",  # pinned Claude
    tools=[save_report],
)
agent("Research AI agent frameworks and save the report.")

That is the whole thing. The Agent class runs the loop; the @tool decorator registers a function as a tool the model can call. The loop is in Python you can inspect, debug, and modify. The model is pinned (per the architectural rule from Part 1). Claude is the production default because tool-use quality matters; the model parameter is one line to swap.

2. Hooks intercept lifecycle events

The hooks system is what most LangChain users build by hand. Strands ships it.

from strands import Agent, tool
from strands.hooks import BeforeToolCallEvent, AfterToolCallEvent

def audit_tool_call(event: BeforeToolCallEvent):
    # Validate, log, or redirect — before the tool actually runs
    log_to_cloudtrail({
        "agent": event.agent_id,
        "tool": event.tool_name,
        "args": event.tool_args,
        "session": event.session_id,
    })
    if event.tool_name == "delete_record" and not event.tool_args.get("approved_by"):
        raise PermissionError("delete_record requires approved_by")

agent = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",
    tools=[save_report, delete_record],
    hooks=[audit_tool_call],
)

The hook fires before every tool call. You can validate, log, redirect, or block. The agent's audit trail in CloudTrail is built from these events rather than reverse-engineered from output logs. For regulated workloads — banking, healthcare, anything with NDPA / GDPR / NIS2 obligations — the hook-based audit is the difference between "we think we have an audit trail" and "we do."

3. Steering handlers — the genuine differentiator

A blocking guardrail says "no." A steering handler says "no, do it this way instead." This is what separates Strands from raw LangChain in production behaviour.

from strands.steering import steer

@steer("sql_query_safety")
def require_where_clause(event):
    """If the model emits a DELETE or UPDATE without WHERE, redirect it."""
    if event.tool_name == "execute_sql":
        sql = event.tool_args.get("query", "").upper()
        if ("DELETE" in sql or "UPDATE" in sql) and "WHERE" not in sql:
            return {
                "redirect": (
                    "That DELETE/UPDATE has no WHERE clause and would affect "
                    "every row. Add a WHERE clause that scopes the change to "
                    "the specific record(s) intended."
                )
            }

The agent reads the steering message back into its reasoning, corrects itself, and proceeds. A LangChain implementation either lets the bad call through, blocks it (which leaves the agent stuck without guidance on how to recover), or requires a custom retry-with-correction loop the developer writes from scratch.

Strands' published benchmark numbers on this: prompt-only agents recovered from ~82.5% of induced errors, hard-coded workflows ~80.8%, agents with steering handlers recovered from every one in the test set. The numbers are the vendor's; the mechanism is real and operationally distinct from what raw frameworks provide.

4. Multi-agent patterns built in

Strands ships two production-relevant multi-agent patterns out of the box.

Agent-as-Tool — a specialist agent registered as a tool that a generalist agent can call.

specialist = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",
    tools=[query_database, generate_chart],
    instruction="You are a data analyst. Answer with charts where useful.",
)

@tool
def ask_data_analyst(question: str) -> str:
    """Delegate a data-analysis question to the specialist agent."""
    return specialist(question).output

orchestrator = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",
    tools=[ask_data_analyst, send_email, ...],
    instruction="You are a research orchestrator.",
)

Swarm — multiple peer agents coordinating on a shared task. Useful for tasks where parallel exploration produces better outcomes than sequential reasoning.

The Agent-as-Tool pattern is the one most production deployments end up using. The specialist agent typically runs on a smaller model (Haiku or Sonnet); the orchestrator runs on Sonnet or Opus. The model-tier routing is per agent, and the cost discipline pays for itself immediately on real workloads.

Conversation managers — memory at the right granularity

Conversation managers are the bridge between the short-term-memory layer of Part 1's four-layer architecture and the actual production behaviour. Strands ships two:

SlidingWindowConversationManager — keeps the last N turns verbatim. Predictable, fast, low cost. Right for transactional agents (customer support, single-task assistants) where the conversation is short.
SummarizingConversationManager — keeps the last N turns verbatim plus a running summary of older turns. Higher per-turn cost (summary updates are an extra model call) but maintains coherence across longer conversations. Right for research agents, copilots, long-running session work.

from strands.memory import SummarizingConversationManager

agent = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",
    tools=[...],
    conversation_manager=SummarizingConversationManager(
        recent_turns=10,
        summary_model="anthropic.claude-haiku-4-5-20251001",  # cheaper for summaries
    ),
)

Note the multi-model routing: the summary updates run on Haiku, not Sonnet, because summarisation is a Haiku-class task. This is the cascade pattern from Part 2 applied at the conversation-management layer. The savings on a long-running agent are material — every turn would otherwise pay Sonnet rates to maintain context.

Human-in-the-loop via event.interrupt()

The hardest agent-deployment problem is the destructive action gate: how does the agent stop and ask a human before doing something irreversible? Strands' event.interrupt() is the primitive.

from strands.hooks import BeforeToolCallEvent

def gate_destructive(event: BeforeToolCallEvent):
    if event.tool_name in ("delete_record", "send_payment", "deploy_to_prod"):
        approval = event.interrupt({
            "type": "approval_required",
            "tool": event.tool_name,
            "args": event.tool_args,
            "reason": "Destructive action — requires human approval before execution.",
        })
        if not approval.get("approved"):
            return {"redirect": "Human declined the action. Stop and report to the user."}

agent = Agent(
    model="anthropic.claude-sonnet-4-6-20251022",
    tools=[..., delete_record, send_payment, deploy_to_prod],
    hooks=[gate_destructive],
)

The interrupt produces a callable hold-point — the agent pauses, the operator receives a notification (Slack, PagerDuty, email, in-app modal — your choice), the operator approves or rejects, the agent resumes. For the case study in Part 8, this is the gate that turns an SRE AI agent from "interesting demo" into "ship in production." Currently shipping in Python; coming to the TypeScript SDK per the published roadmap.

Deployment targets — where Strands actually runs

The Strands deployment story is, in order of AWS-nativeness:

AgentCore Runtime — AWS's managed long-running agent runtime. Multi-hour task execution, session isolation, built-in memory, MCP gateway, browser tool, code interpreter. The right deployment target for production agents that need to do real work over time. Pricing is execution-time-based, not request-based.
AWS Lambda — for short-lived stateless agent invocations. The right target for request/response agents where a single turn is the whole job. 15-minute Lambda time limit applies; multi-turn agents that hit the limit need AgentCore.
AWS Fargate — for sustained agent processes that need a longer runtime than Lambda but don't need AgentCore's specific feature set. Good for scheduled agents, batch processors, agent-as-microservice deployments.
Amazon EKS — for teams already on Kubernetes who want the agent under the same control plane as the rest of their workloads. Standard kubectl, helm, argocd flow.
Docker anywhere — Strands ships as a portable Python or TypeScript package; the deployment is a container the host platform doesn't have to know about.
Terraform — official modules for all the above; deployment is infrastructure-as-code from day zero.

For most cmdev work, AgentCore for production long-running, Lambda for stateless request/response, Fargate for sustained scheduled work is the deployment triad that covers the cases. The choice is per-agent, not per-organisation.

LangChain and LlamaIndex — when they're still the right answer

The Strands centre column does not mean LangChain and LlamaIndex are off the table. The cases where they're genuinely the right choice:

Cross-cloud or cross-vendor portability is a hard requirement. A workload that has to run on AWS, Azure, and GCP with the same agent code — LangChain abstracts the model provider; Strands is Bedrock-coupled by default and portable only with effort.
The team has existing LangChain or LlamaIndex investment. Migration cost across an existing codebase is non-trivial. The pragmatic move is often to keep the framework and use Bedrock as the model provider behind it.
The use case is on a non-Bedrock model. Direct access to a specific OpenAI, xAI, or self-hosted model that Bedrock doesn't catalogue. LangChain has the broadest model-provider abstraction in the space.
Heavy ecosystem use. Specific LangChain plugins, document loaders, or retrievers that have no equivalent in Strands' (newer, smaller) ecosystem. LangChain's age is sometimes its advantage.

A working LangChain-on-Bedrock pattern:

from langchain_aws import ChatBedrockConverse
from langchain.agents import create_tool_calling_agent, AgentExecutor
from langchain_core.prompts import ChatPromptTemplate

llm = ChatBedrockConverse(
    model="anthropic.claude-sonnet-4-6-20251022",
    region_name="us-east-1",
)

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a customer support agent..."),
    ("placeholder", "{chat_history}"),
    ("human", "{input}"),
    ("placeholder", "{agent_scratchpad}"),
])

agent = create_tool_calling_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools)
executor.invoke({"input": "Where's my order #12345?"})

That works, runs Claude on Bedrock under the LangChain orchestrator, and is the right move when LangChain is the existing investment. The trade-off is that the hooks, steering, and multi-agent patterns Strands ships out of the box need to be reimplemented as LangChain callbacks and chain compositions — more code, more maintenance, more bug surface.

The model-tier routing pattern, framework-agnostic

Regardless of which framework you pick, the multi-model routing rule applies: cheap models for narrow tasks, Claude for reasoning, route by query complexity. In Strands:

def classify_complexity(query: str) -> str:
    """Use Haiku to decide whether a query needs Sonnet or Haiku."""
    classifier = Agent(
        model="anthropic.claude-haiku-4-5-20251001",
        instruction="Classify the query complexity: 'simple' or 'complex'. Reply with one word.",
    )
    return classifier(query).output.strip().lower()

def route_agent(query: str):
    complexity = classify_complexity(query)
    model_id = {
        "simple": "anthropic.claude-haiku-4-5-20251001",
        "complex": "anthropic.claude-sonnet-4-6-20251022",
    }.get(complexity, "anthropic.claude-sonnet-4-6-20251022")

    agent = Agent(model=model_id, tools=[...])
    return agent(query)

Average per-query cost drops materially on workloads with a heavy long tail of simple queries. The classifier itself costs a Haiku call; the savings on routed-to-Haiku queries pay for it many times over. This is the cascade pattern that Part 7 goes deep on.

Operational realities — what production looks like

Five things that show up in production agent deployments regardless of framework:

Cold starts on Lambda hurt. A 2-3 second cold start on a Lambda-hosted agent is unacceptable for conversational latency. Provisioned concurrency or moving to AgentCore Runtime / Fargate are the answers.
Retries need to be intelligent. A failed tool call retried naively often fails the same way. Strands' hooks system lets you implement smart retries with backoff and context modification; LangChain requires custom callback work.
Token usage is the hidden cost. Hooks, steering, multi-agent calls, summarisation — each adds tokens. Instrument every model call with usage tracking from day zero; debugging cost spikes after the fact is expensive.
The eval harness is not optional. Whatever framework, you need a golden set of inputs with expected outputs (or behaviours) and a regression test that runs on every change. Without it, framework upgrades break production silently.
Production agents drift. The model changes (catalogue updates), the corpus changes (KB ingestion), the tools change (Lambda updates). Drift detection — comparing agent behaviour today against the harness baseline — is what catches the change-induced regression before users do.

What's next

Part 4 picks up the model layer: when prompt engineering and RAG hit their limit, when fine-tuning makes sense, and the combined-model pattern that pairs pinned Claude for the hard reasoning step with a custom-tuned smaller model for the narrow recurring task.

The full series:

Part 1 — Foundations: Building AI Agents on Amazon Bedrock
Part 2 — RAG with Bedrock Knowledge Bases
Part 3 — Open-source Agent Frameworks on Bedrock (this piece)
Part 4 — Model Customization on Amazon Bedrock
Part 5 — Multi-step AI Workflows with Step Functions and Bedrock
Part 6 — Security Guardrails and Observability for Bedrock
Part 7 — Cost Optimization on Bedrock (deepest multi-model routing)
Part 8 — Case Study: An SRE AI Agent on Bedrock for CloudWatch Log Triage

Reference: strandsagents.com for the canonical Strands documentation. The Hardening-before-AWS and AWS-for-banks series provide the security and identity substrate; this AI series builds the workload on top.