Multi-step AI Workflows with Step Functions and Bedrock

Series · Amazon Bedrock for Production AI · Part 5 of 8 ← Part 4: Model Customization · Multi-step Workflows with Step Functions · Part 6: Security Guardrails and Observability →

Where a single agent stops being enough

A Bedrock Agent — from Part 1 — handles a single conversational turn with tool calls. That covers a large fraction of production AI work. It does not cover the cases where the work has a known structure that benefits from being expressed as structure rather than discovered by an agent each turn:

Document processing pipelines. An invoice arrives. Extract fields. Enrich vendor data from a master record. Classify (capex / opex / travel / cogs). Route to the right approver. Each step is well-defined; the variation is in the document, not the workflow.
Multi-document synthesis. Pull ten regulatory PDFs. Summarise each in parallel. Compare summaries. Produce a consolidated view. The parallel-then-merge shape doesn't fit a single agent loop.
Long-running research. Decompose a question. Spawn sub-tasks. Run web searches, database queries, model summarisations in parallel. Reassemble. Quality-check. Re-run failed branches.
Conditional pipelines. If the document is in English, route to model A. If German, model B. If the confidence is low, escalate to human review. Conditional branching is awkward inside an agent's reasoning trace; it's native to a state machine.

For these shapes, AWS Step Functions is the orchestration layer. The Bedrock task type (introduced in 2024) gives Step Functions direct integration with Bedrock model invocations — no Lambda wrapping required. Combined with Step Functions' coverage of 9,000+ AWS API actions, the result is a workflow language where the AI calls are first-class citizens alongside every other AWS service.

Step Functions or Bedrock Agents — the decision

The framing isn't "which is better." Both are right for different workload shapes. The framework:

The matrix in one table:

Workload shape	Right answer
Single-turn conversational reasoning with tool use	Bedrock Agent (or AgentCore + Strands)
Long-running agent with multi-hour task, session memory, autonomous browsing	AgentCore Runtime + Strands
Known multi-step pipeline with conditional branches	Step Functions Standard + Bedrock task
High-frequency short pipelines (sub-5-min, event-driven)	Step Functions Express + Bedrock task
Single Bedrock invocation, no orchestration needed	Lambda → Bedrock Converse
Combination: agent inside a larger pipeline	Step Functions calling Bedrock Agent as one step

The last row is worth flagging — Step Functions and Bedrock Agents compose. A Step Functions workflow can invoke a Bedrock Agent as one task in a larger state machine. The Agent handles the conversational complexity of one turn; Step Functions handles the cross-step orchestration around it.

The shape of a multi-step AI workflow

A reference document-processing workflow that recurs across cmdev engagements:

Five state-machine primitives that recur in AI workflows:

Task with Bedrock integration — direct call to bedrock:InvokeModel or bedrock:Converse without Lambda. The state-machine definition includes the model ID, the prompt template (with state-substituted variables), and the inference parameters.
Choice state — conditional branching on the model's output. Often the model returns a structured JSON; Step Functions branches on a JSONPath expression. Far cleaner than asking the agent to "decide" implicitly.
Parallel state — fan out to multiple branches simultaneously. Each branch can be its own Bedrock invocation, a SQL query, an API call. Re-converges when all branches complete.
Map state — apply the same workflow to each item in a list. Process ten line items by running the same enrichment pipeline ten times in parallel. Max concurrency, error tolerance, and per-item retries are all configurable.
Catch / Retry — error handling that's aware of Bedrock-specific failure modes (rate limits, model unavailable, context-too-long). Smart retries with exponential backoff and per-error-type routing.

The Bedrock task — direct integration

The 2024 Bedrock integration with Step Functions removed the need for Lambda wrappers around every model call. A Bedrock task in ASL (Amazon States Language):

{
  "ExtractInvoiceFields": {
    "Type": "Task",
    "Resource": "arn:aws:states:::bedrock:invokeModel",
    "Parameters": {
      "ModelId": "anthropic.claude-haiku-4-5-20251001",
      "Body": {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 2000,
        "messages": [
          {
            "role": "user",
            "content.$": "States.Format('Extract invoice fields from: {}', $.documentText)"
          }
        ]
      }
    },
    "ResultPath": "$.extraction",
    "Next": "ClassifyByType"
  }
}

Three things this gives you compared to "Lambda calling Bedrock":

No Lambda cold start on the model call. The Bedrock task is a synchronous state-machine action.
No Lambda concurrency limits. Bedrock task scales with the underlying model's throughput rather than the account's Lambda concurrency.
Native cost-tag attribution. Every Bedrock task in the workflow can be tagged independently, so the cost dashboard from Part 7 attributes spend per workflow step, not per Lambda.

The trade-off: the state-machine definition holds the prompt template, which keeps prompts in version control but makes them harder to A/B-test compared to a Lambda-managed prompt. For most production workflows, the version-control benefit outweighs the testability cost.

The multi-model routing pattern, applied to workflows

The Claude-first / multi-model rule from earlier parts applies cleanly to Step Functions workflows. The state machine becomes the routing mechanism — each step picks the right model tier for its specific task.

A typical document-processing workflow:

State	Model	Why
Extract	Claude Haiku	Structured extraction from a known document type. Haiku is fast, cheap, and accurate enough for well-formed extraction.
Route by type	Choice state (no model)	Branching on the structured output of extraction. Free.
Enrich each line item	Map state with Claude Haiku per item	Per-item enrichment is high-volume and narrow — Haiku is the right tier. Parallelism keeps wall-clock low.
Score risk	Claude Sonnet	Risk scoring requires reasoning over patterns, anomalies, vendor history. Sonnet's reasoning depth justifies the cost.
Generate audit narrative	Claude Sonnet	Natural language synthesis for the audit trail.
Final human-readable summary	Claude Opus only if the document is high-value or contested	Opus is overkill for routine documents; right for legal or regulatory documents where the synthesis stakes are high.

A workflow that runs ten thousand invoices a day through this pipeline at full Sonnet rates costs ~10× what the same workflow costs with the routing above. The savings come from putting Haiku on the easy steps; the quality is preserved by putting Sonnet on the steps where reasoning depth matters.

The router pattern from Part 3 generalises here: a Haiku-tier state at the start of the workflow can decide which subsequent path the document takes (simple vs complex routing), and the state machine branches accordingly.

Standard vs Express — pick by duration and volume

Step Functions has two execution modes:

Standard Workflows — execution time up to one year, full audit trail in S3, priced per state transition (~$0.025 per 1,000 transitions). Right for long-running workflows, anything that needs durable retry semantics, anything with a regulatory audit requirement.
Express Workflows — execution time up to five minutes, priced by execution duration and memory (much cheaper at scale), audit via CloudWatch Logs only. Right for high-volume short workflows (per-event processing, real-time enrichment).

The choice for AI workflows is mostly determined by the workload's time profile:

Real-time document classification on file upload → Express (~30-90 second runtime, high volume)
Multi-document research synthesis → Standard (multi-minute, lower volume, audit-relevant)
Asynchronous batch processing of a corpus → Standard (multi-hour, parallel, fully audited)
Per-message content moderation → Express (sub-second, very high volume)

For workflows that touch personal data, evidence-grade audit, or regulated workloads, Standard is the right answer regardless of duration — the S3-backed execution history is what NDPA / GDPR / NIS2 auditors actually want to see.

Error handling that's aware of Bedrock-specific failures

Step Functions' Retry and Catch blocks need to be tuned for the failure modes Bedrock workloads actually produce:

{
  "Retry": [
    {
      "ErrorEquals": ["Bedrock.ThrottlingException"],
      "IntervalSeconds": 2,
      "BackoffRate": 2.0,
      "MaxAttempts": 4,
      "JitterStrategy": "FULL"
    },
    {
      "ErrorEquals": ["Bedrock.ModelTimeoutException"],
      "IntervalSeconds": 5,
      "BackoffRate": 1.5,
      "MaxAttempts": 2
    },
    {
      "ErrorEquals": ["Bedrock.ValidationException"],
      "MaxAttempts": 0
    }
  ],
  "Catch": [
    {
      "ErrorEquals": ["Bedrock.ServiceQuotaExceededException"],
      "ResultPath": "$.error",
      "Next": "RouteToFallbackModel"
    },
    {
      "ErrorEquals": ["States.ALL"],
      "ResultPath": "$.error",
      "Next": "DeadLetterQueue"
    }
  ]
}

Two operational rules:

Don't retry validation errors. A bad prompt template or malformed input won't fix itself; retries waste tokens.
Catch quota-exceeded into a fallback path. When the primary model (Sonnet) hits a quota, route to a fallback (Haiku, or a different region's Sonnet). The workflow completes; the audit trail records the fallback for cost and quality review.

Powertools for AWS Lambda — when Lambda still has a role

The Bedrock task removed Lambda as a wrapper for most Bedrock calls. Lambda still has a role:

Custom tool implementations for Bedrock Agents (Lambda is the action-group execution surface — covered in Part 1)
Pre-processing or post-processing that doesn't map to a Step Functions intrinsic — complex data shaping, cryptographic operations, external API calls with custom auth
OpenAPI schema generation for action groups — Powertools for AWS Lambda includes a Bedrock Agent helper that generates the OpenAPI schema from typed Python function signatures

A working pattern with Powertools:

from aws_lambda_powertools.event_handler import BedrockAgentResolver
from aws_lambda_powertools.event_handler.openapi.params import Query

app = BedrockAgentResolver()

@app.get("/cloudwatch_logs", description="Query CloudWatch Logs Insights for a log group")
def query_cloudwatch(
    log_group: str = Query(description="Log group name, e.g. /aws/lambda/api-handler"),
    query: str = Query(description="CloudWatch Insights query string"),
    minutes_back: int = Query(default=60, description="Time window in minutes"),
) -> dict:
    """Returns the parsed query results."""
    # actual implementation calls boto3 CloudWatch Logs
    return {"results": [...]}

def lambda_handler(event, context):
    return app.resolve(event, context)

Powertools generates the OpenAPI schema the Bedrock Agent reads, handles input validation, emits structured logs and X-Ray traces, and provides typed responses. It eliminates roughly half the boilerplate of writing an action-group Lambda by hand.

A working multi-step pipeline — research synthesis

A real workflow that combines all of the above: a research-synthesis agent that takes a topic, gathers source documents, summarises each in parallel, identifies contradictions, and produces a consolidated briefing.

The state machine shape (ASL excerpts):

{
  "Comment": "Research synthesis workflow",
  "StartAt": "DecomposeTopic",
  "States": {
    "DecomposeTopic": {
      "Type": "Task",
      "Resource": "arn:aws:states:::bedrock:invokeModel",
      "Parameters": {
        "ModelId": "anthropic.claude-sonnet-4-6-20251022",
        "Body": {
          "anthropic_version": "bedrock-2023-05-31",
          "max_tokens": 1500,
          "messages": [{
            "role": "user",
            "content.$": "States.Format('Decompose this research question into 5-8 sub-queries: {}', $.topic)"
          }]
        }
      },
      "ResultPath": "$.subQueries",
      "Next": "FanOutResearch"
    },
    "FanOutResearch": {
      "Type": "Map",
      "ItemsPath": "$.subQueries.body.content[0].text",
      "MaxConcurrency": 8,
      "Iterator": {
        "StartAt": "QueryKnowledgeBase",
        "States": {
          "QueryKnowledgeBase": {
            "Type": "Task",
            "Resource": "arn:aws:states:::bedrock-agent:retrieveAndGenerate",
            "Parameters": {
              "Input": { "Text.$": "$.subQuery" },
              "RetrieveAndGenerateConfiguration": {
                "Type": "KNOWLEDGE_BASE",
                "KnowledgeBaseConfiguration": {
                  "KnowledgeBaseId": "KB-research-corpus-001",
                  "ModelArn": "anthropic.claude-haiku-4-5-20251001"
                }
              }
            },
            "End": true
          }
        }
      },
      "ResultPath": "$.researchResults",
      "Next": "IdentifyContradictions"
    },
    "IdentifyContradictions": {
      "Type": "Task",
      "Resource": "arn:aws:states:::bedrock:invokeModel",
      "Parameters": {
        "ModelId": "anthropic.claude-sonnet-4-6-20251022",
        "Body": {
          "max_tokens": 3000,
          "messages": [{
            "role": "user",
            "content.$": "States.Format('Identify contradictions across these summaries: {}', $.researchResults)"
          }]
        }
      },
      "ResultPath": "$.contradictions",
      "Next": "SynthesiseBriefing"
    },
    "SynthesiseBriefing": {
      "Type": "Task",
      "Resource": "arn:aws:states:::bedrock:invokeModel",
      "Parameters": {
        "ModelId": "anthropic.claude-sonnet-4-6-20251022",
        "Body": {
          "max_tokens": 4000,
          "messages": [{
            "role": "user",
            "content.$": "..."
          }]
        }
      },
      "ResultPath": "$.briefing",
      "Next": "DeliverBriefing"
    },
    "DeliverBriefing": {
      "Type": "Task",
      "Resource": "arn:aws:states:::s3:putObject",
      "Parameters": {
        "Bucket": "research-briefings-prod",
        "Key.$": "States.Format('briefings/{}.md', $.briefingId)",
        "Body.$": "$.briefing.body.content[0].text"
      },
      "End": true
    }
  }
}

What this workflow shows:

Decomposition runs on Sonnet (reasoning step, Sonnet justified).
Parallel research uses Map state with MaxConcurrency: 8 — eight sub-queries run simultaneously against the Knowledge Base, each through Haiku (the per-query reasoning is light, parallelism handles the volume).
Contradiction analysis runs on Sonnet (genuinely reasoning over the parallel outputs).
Final synthesis runs on Sonnet (could escalate to Opus if the briefing is high-stakes; the model-ID is the only line that changes).
Delivery writes to S3 with direct Step Functions integration — no Lambda required.

End-to-end latency for a typical research run: 90-180 seconds. End-to-end cost: dominated by the synthesis step, with the Map parallel branches contributing modestly because Haiku is cheap.

Operational realities

Five things that bite in production Step Functions + Bedrock workflows:

State input/output size limits. Step Functions imposes a 256KB payload limit per state. Long retrieval-augmented outputs blow this fast. The fix is to store intermediate results in S3 and pass S3 keys through the workflow, not the contents themselves.
Map state concurrency tuning. MaxConcurrency too high overwhelms Bedrock quotas; too low slows the workflow. Tune per region and per model.
Prompt template hygiene. Prompts in ASL are harder to lint than prompts in Python. Treat them like SQL — put them in version-controlled files, render via States.Format, never inline complex prompts directly in JSON.
Workflow versioning. A state machine change can break in-flight executions. Use the state-machine versioning + aliases pattern; cut over with traffic shifting.
Cost visibility per workflow. Tag each Bedrock task independently. Without tags, the bill shows "Bedrock — $X" and the team has no idea which workflow caused it.

When not to reach for Step Functions

The dual to the matrix above: when staying with an Agent or a single Lambda is the right answer.

The workflow is genuinely conversational — multiple turns with the same user, where the next step depends on what the user says next. Stay with Bedrock Agent or AgentCore + Strands.
The workflow is a single model call with light pre/post-processing. Use Lambda → Bedrock Converse directly. Step Functions adds overhead the workload doesn't need.
The workflow is short, high-frequency, and latency-sensitive (sub-100ms). Lambda or direct API integration. Step Functions adds tens of milliseconds even in Express mode.
The workflow is exploratory — you don't yet know the shape. Build it as an agent first. When the shape stabilises and the per-step structure becomes clear, refactor into Step Functions if the gains justify it.

What's next

Part 6 picks up the security and observability layer that wraps every workflow, every agent, and every direct Bedrock invocation: Guardrails policy design, IAM patterns, VPC endpoints, CloudTrail audit, model invocation logging, X-Ray tracing across the multi-step orchestration documented above.

The full series:

Part 1 — Foundations: Building AI Agents on Amazon Bedrock
Part 2 — RAG with Bedrock Knowledge Bases
Part 3 — Open-source Agent Frameworks on Bedrock
Part 4 — Model Customization on Amazon Bedrock
Part 5 — Multi-step AI Workflows with Step Functions and Bedrock (this piece)
Part 6 — Security Guardrails and Observability for Bedrock
Part 7 — Cost Optimization on Bedrock (deepest multi-model routing)
Part 8 — Case Study: An SRE AI Agent on Bedrock for CloudWatch Log Triage

The Amazon Bedrock series. Step Functions composes with everything in the prior pieces — Agents become workflow steps, Knowledge Bases become state-machine retrievals, custom models become routing branches.