Chatbots predict tokens. Agents take actions.
A chatbot receives a prompt and generates a text response. It is a completion engine. No matter how sophisticated the model, the chatbot's output is always the same kind of thing: a string of text. Ask it to check your CRM for overdue invoices and it will write a paragraph about how one might check a CRM for overdue invoices. It will not actually check.
An AI agent is different in kind, not degree. An agent uses the model as a reasoning engine, but it also has hands. It can read documents from a file system, query a database, call an API, write a file, send an email, and chain those operations together into multi-step workflows. When an agent receives a request like "find all overdue invoices and send reminders to clients who haven't responded in 14 days," it decomposes that into subtasks: query the invoicing system for overdue records, filter by last communication date, draft personalized reminder emails, send them through the client's email system, and log each action. At every step, the agent evaluates intermediate results and adjusts. If the CRM returns an unexpected schema, the agent adapts. If the email service rate-limits, it queues and retries.
This is the system we build at OpenClaw. The model does the thinking. The tools do the work. The architecture that connects them determines whether the system is a toy or a production capability.
MCP: standardized tool access
The first problem in building production agents is integration. Every client has a different stack. One runs Salesforce, Gmail, and SharePoint. The next uses HubSpot, Outlook, and Google Drive. A third has a custom ERP with a REST API and documents in S3. Building custom integration code for each deployment does not scale. The math is simple: if each client requires four weeks of integration work, the business model breaks.
Model Context Protocol (MCP), developed by Anthropic, solves this by standardizing how AI models interact with external tools. MCP defines a protocol — a contract for how a model discovers available tools, understands their capabilities, and invokes them with structured inputs and outputs. We build MCP servers for each integration category:
- Email: Gmail, Outlook, custom SMTP
- CRM: Salesforce, HubSpot, Zoho
- Document stores: Google Drive, SharePoint, S3, local filesystems
- Databases: PostgreSQL, MongoDB, MySQL
- Messaging: Slack, WhatsApp Business API, Microsoft Teams
- Calendars: Google Calendar, Outlook Calendar
Each MCP server exposes a set of tools with typed schemas. The Gmail MCP server exposes tools like search_emails, send_email, get_thread, and create_draft. The Salesforce server exposes query_records, update_record, create_record, and get_report. The agent does not know or care about OAuth flows, API pagination, or rate-limit handling — the MCP server handles all of that.
When we deploy to a new client, the process is composition, not construction. We assess the client's stack, select the relevant MCP servers from our library, configure credentials and permissions, and wire them into the agent's tool manifest. A client running Salesforce, Gmail, and Google Drive gets three pre-built servers composed together. The deployment conversation shifts from "how do we integrate with your systems" to "which of your systems should the agent access."
This is the architectural decision that makes seven-day deployments realistic. We chose MCP over building a proprietary tool abstraction because MCP is an open protocol with growing ecosystem support. The trade-off: we depend on the protocol's evolution and must contribute upstream when we hit edge cases the spec does not cover. We accepted that trade-off because the alternative — maintaining a proprietary integration layer — is a staffing problem that compounds.
RAG: domain knowledge at retrieval time
An agent with tools can take actions, but it still needs domain knowledge. A legal operations agent must understand contract law terminology. A compliance agent needs to know the specifics of NDPA and GAID. An HR agent must reference company policy documents. This knowledge cannot be baked into the model's weights — it is too specific, too dynamic, and often confidential.
RAG (Retrieval-Augmented Generation) is how we inject domain knowledge at inference time. The pipeline has seven stages, and each one matters:
1. Document ingestion. We pull documents from wherever they live — SharePoint folders, S3 buckets, Google Drive, local file shares. We handle PDFs, Word documents, Excel spreadsheets, scanned images, and HTML. The ingestion pipeline is event-driven: when a document is added or updated in the source system, the pipeline triggers automatically.
2. Preprocessing. Scanned documents go through OCR (Textract via AWS). PDFs are parsed with layout-aware extraction that preserves table structures and heading hierarchies. Format conversion normalizes everything to structured text with metadata.
3. Chunking. This is where most RAG implementations fail. Fixed-size chunking (split every 512 tokens) destroys context. A paragraph about contract termination clauses gets split mid-sentence and merged with unrelated content. We use semantic chunking: the system identifies natural document boundaries — sections, paragraphs, list items, table rows — and creates chunks that preserve semantic coherence. We also maintain per-document-type chunking strategies. Legal contracts are chunked by clause. Policy documents are chunked by section. Financial reports are chunked by table and commentary block.
4. Embedding. Each chunk is embedded using Cohere Embed v3 via Bedrock. We chose Cohere over Titan for multilingual performance — our clients operate across English, French, and Arabic documents. The embedding captures semantic meaning, not just keyword presence.
5. Vector storage. Embeddings go into Pinecone for clients with large document corpora (100K+ chunks) and pgvector for smaller deployments where we want to minimize infrastructure. Each vector carries metadata: source document, page number, section heading, document type, last modified date, and access permissions.
6. Retrieval. When the agent needs knowledge, we run hybrid search: semantic similarity (cosine distance against the query embedding) combined with keyword search (BM25) against the raw text. Hybrid search catches what either method alone misses — semantic search finds conceptually related content, keyword search catches exact terms and identifiers that embeddings sometimes flatten. We apply metadata filters before search: the agent querying about HR policy only searches HR documents, not the entire corpus.
7. Reranking. The initial retrieval returns the top 20 candidates. A reranking model (Cohere Rerank) rescores them against the original query with cross-attention, considering the full query-document interaction rather than just embedding similarity. The top 5 after reranking go into the generation prompt. This step consistently improves answer accuracy by 15-20% over retrieval alone.
The generation step is Claude receiving the retrieved context alongside the user's question, with instructions to ground its response in the provided documents and cite sources. The agent does not hallucinate contract clauses because it is reading the actual clause, not generating one from parametric memory.
Orchestration: multi-agent task decomposition
A single agent with tools and knowledge handles straightforward tasks well. Complex workflows require orchestration — multiple specialized agents coordinating across steps.
When the system receives a complex request, the orchestration layer decomposes it into a task graph. Consider: "Review all vendor contracts expiring in Q3, flag non-standard liability clauses, and draft renewal recommendations for the legal team." The orchestrator breaks this into:
1. Research Agent → query contract database for Q3 expirations
2. Analysis Agent → for each contract, extract liability clauses
3. Analysis Agent → compare clauses against standard template
4. Drafting Agent → write renewal recommendations for flagged contracts
5. Review Agent → check recommendations for completeness and accuracy
Each agent is specialized. The research agent is optimized for database queries and document retrieval. The analysis agent carries domain-specific prompts for legal clause comparison. The drafting agent produces structured output in the client's preferred format. The review agent applies quality checks.
State management is the hard part. Multi-step workflows span minutes or hours. The orchestrator maintains a persistent state object for each workflow: which steps have completed, what intermediate results were produced, where the workflow currently stands. If the system restarts or the workflow is interrupted (a tool call times out, a model returns an error), the orchestrator resumes from the last successful checkpoint rather than restarting from scratch.
This is where prompt engineering meets software engineering. The orchestrator itself is model-driven — Claude decides how to decompose tasks and which agent handles each step. But the state machine, retry logic, timeout handling, and checkpoint persistence are traditional software. Neither approach alone is sufficient.
Model selection and routing
Not every subtask needs the most capable model. Running Claude Opus on a binary classification task ("Is this document a contract? Yes or no.") is wasteful. Running a small model on complex legal analysis produces poor results. The system routes tasks to models based on three factors:
Complexity. Multi-step reasoning, nuanced analysis, and long-form generation go to Claude (Sonnet or Opus via Bedrock, depending on the task's difficulty). Classification, entity extraction, and simple routing go to Mistral or Cohere Command. Embedding goes to Cohere Embed v3.
Latency. User-facing interactions where someone is waiting need fast responses. We route these to Sonnet or smaller models. Background processing (document analysis, batch report generation) can use slower, more capable models.
Cost. Every model invocation has a token cost. The routing layer enforces per-request token budgets. If a workflow is consuming more tokens than expected, the router escalates to cheaper models for remaining subtasks and flags the anomaly for review.
The routing logic is a decision tree, not a model call. We do not use AI to decide which AI to use — that adds latency, cost, and a failure mode. A rules-based router with clear thresholds is faster, cheaper, and debuggable.
Production concerns: observability, cost, and failure
Running AI agents in production — where they take real actions on real systems — demands a level of observability that prototype deployments never consider.
Logging. Every tool call, every model invocation, every decision branch is logged with structured metadata: timestamp, workflow ID, agent type, model used, input tokens, output tokens, tool name, tool result status, and latency. When an agent sends an incorrect email or flags the wrong contract, we reconstruct the entire decision chain from logs. This is not optional — it is a compliance requirement for regulated industries.
Cost control. Token budgets are enforced at three levels: per-request (a single user query cannot consume more than a configured token ceiling), per-workflow (a multi-step workflow has a total budget), and per-client (monthly spend caps). When a budget is approached, the system alerts operations. When a budget is exceeded, the system degrades gracefully — switching to cheaper models or refusing to continue rather than running up unbounded costs.
Error recovery. Tool calls fail. APIs return 500s. Models hallucinate tool invocations with malformed parameters. The system handles each failure mode: transient errors get exponential backoff retries, persistent failures trigger fallback to alternative tools (if the primary email API is down, queue for retry rather than dropping the action), and model errors trigger re-prompting with explicit correction. If all recovery attempts fail, the workflow pauses and alerts a human operator.
The "agent did something unexpected" scenario. This is the production concern that keeps us honest. An agent with access to a client's email system could, in theory, send emails the client did not intend. We mitigate this with action approval gates: high-impact actions (sending external emails, modifying database records, deleting documents) require explicit confirmation before execution. The confirmation can come from a human operator or from a policy engine that evaluates the action against predefined rules. Low-risk actions (reading documents, running search queries) execute without gates. The threshold between low-risk and high-risk is configurable per client.
Why MCP is the infrastructure moat
Every MCP server we build is reusable across every client who uses that system. Build the Salesforce MCP server once, deploy it to every Salesforce client. Build the SharePoint server once, use it everywhere SharePoint exists. The library grows with each deployment.
The economics compound. Our first client deployment required building five MCP servers from scratch. The second client shared three of those five. By the tenth client, most deployments require zero or one new server — the rest are composition from the existing library.
This is the moat. A competitor starting today must build each integration from scratch for their first client, and their second, and their third — unless they adopt the same architecture. The marginal cost of our next deployment drops with every deployment we complete. The marginal cost of a bespoke-integration approach stays flat.
The network effect extends beyond our own deployments. MCP is an open protocol. As the ecosystem grows — as other teams build and open-source MCP servers — our library grows without our engineering effort. We contribute servers upstream, others contribute servers we can use. This is a bet on the ecosystem, and so far, the bet is paying off.
The question for engineering leaders evaluating build-vs-buy is not whether your team can build an AI agent. Any competent team can wire a model to an API. The question is whether your team should spend months building integration infrastructure that exists as a composable, reusable layer — or whether that engineering time is better spent on the domain-specific logic that actually differentiates your product.
We chose to build the infrastructure layer because agent deployment is our product. For most organizations, the right answer is the opposite: use an infrastructure layer that already exists, and invest your engineering effort in the domain problems only you can solve.
