בניית ארכיטקטורות סוכני AI שמתרחבות
Most "multi-agent system" demos collapse the first time they hit a real workload. Agents talk in circles, the token bill triples, and you cannot tell which of the seven agents produced the wrong answer. I have shipped enough of these to have opinions about what actually holds up. This is how I think about agent architecture, grounded in systems I have built and run: a Trading AI analyst on LangGraph, OrchestKit (my open-source Claude Code framework), SkillForge (a 3-tier platform with 20+ agents), and a clinical voice-to-chart pipeline.
Start with one agent and a loop, not a framework
The default starting point should be a single agent in a loop: a prompt, a set of tools, and a while that keeps calling the model until it stops requesting tool calls. No graph, no supervisor, no message bus. For a surprising number of problems this is the correct final architecture, not just the prototype.
Jumping to multi-agent early has an honest cost that nobody puts on the slide. You pay for inter-agent message passing in tokens. You pay for it in latency, because every handoff is another round trip to the model. And you pay for it in debuggability, because a bug is now "somewhere in the conversation between four agents" instead of "in this function." If one agent with good tools can do the job, that is the answer. Ship it.
When you actually need more than one agent
There are three forcing functions that genuinely push me past a single agent, and none of them is "it feels more sophisticated."
The first is context-window pressure. In SkillForge, the job is to turn a URL into implementation-ready artifacts for an AI coding assistant. The crawl output, the extracted structure, the synthesis, and the validation cannot all share one context window without the model losing the thread. Splitting the work means each agent sees only what it needs.
The second is conflicting system prompts. The instructions that make a market-analysis agent good (be skeptical, weigh contradictory signals) are the opposite of what makes a risk-modeling agent good (be conservative, assume the worst case). In the Trading AI analyst, market analysis, signal generation, risk modeling, and portfolio management are separate agents precisely because their objective functions conflict. One prompt cannot serve all four without becoming mush.
The third is tool routing. Once an agent has thirty tools, its selection accuracy drops. Partitioning tools across specialist agents keeps each agent's action space small enough that it picks correctly.
Supervisor/worker over flat swarms
I route everything through an explicit graph. In the Trading AI analyst that graph is LangGraph: a supervisor node decides which specialist runs next, the specialists do their work, and results stream to clients over WebSockets. The agents never free-talk to each other. They read and write shared state, and the graph decides transitions.
The failure mode of autonomy, where you let agents converse and self-organize, is that you lose the two things production needs most: determinism and a place to put a breakpoint. Free-talking swarms loop, they re-litigate decisions already made, and their cost is unbounded because nothing owns the stop condition. An explicit graph gives you a finite state machine. You can draw it, replay it, and assert on it.
This is why my strongest opinion in this space is: most problems need a DAG, not autonomy. The clinical voice-to-chart pipeline is the clean example. It chains transcription (Groq or Gemini) into structured medical-finding extraction into chart-text generation. That is a linear pipeline, three nodes, no supervisor needed. Reaching for an autonomous swarm there would add nothing but failure surface.
Designing the agent boundary
The hardest design question is what becomes its own agent versus a tool call versus a sub-graph node. My rule of thumb:
- It is a tool if it is deterministic or a single external call: query a database, hit an API, run a calculation. Tools do not reason.
- It is a node if it is one reasoning step with a fixed role inside a known flow: the extraction step in the voice-to-chart pipeline is a node, not an agent.
- It is its own agent only when it needs its own system prompt, its own tool set, and its own multi-turn loop to finish its job.
OrchestKit is the worked example of carrying this to scale. Its agents are markdown files under plugins/ork/agents/, an inspectable, plain-text definition format. The count is live-sourced from the GitHub tree at build time through src/lib/orchestkit-stats.ts rather than hardcoded, so the site never drifts from the repo. The decomposition there is by genuine specialty: a backend-system-architect and a debug-investigator are different agents because they carry different expertise, different default tools, and different failure modes, not because splitting them looked tidy on a diagram.
OrchestKit is the open-source half of how I run this. The other half is private: a roster of eleven operations agents wired to my own platform's infrastructure, each scoped to one job and one blast radius. A read-only tier handles daily briefs, observability, and security posture; mutating agents own deploys and content; writer agents own comms and commerce. Read-only versus mutating is not cosmetic. It is the boundary that decides what an always-on agent can break, and it is the first thing I design when an agent gets the keys to production.
SkillForge taught me why tiering beats a flat roster once you pass a handful of agents. With 20+ agents in a flat swarm, coordination cost explodes and no one owns sequencing. The 3-tier structure fixes that: a planner tier decides what needs doing, a worker tier does it in parallel, and a synthesis tier reconciles the outputs into the final artifact. Tiers give you a place for fan-out and a place for fan-in. Flat swarms have neither.
State and memory between agents
Here is the deflationary truth: most "agent memory" is a database you already know how to run. I pass structured state through the graph as the source of truth for a single run. For anything that needs to persist or be shared, the storage is boring on purpose.
- Postgres with pgvector holds durable, queryable state and embeddings. Both the Trading AI analyst and SkillForge use it. Semantic recall is a vector query, not a new category of infrastructure.
- Redis holds hot, ephemeral coordination state: what is in flight, short-lived run context, pub/sub for streaming.
You do not need a bespoke "memory layer" product to start. You need a schema for the state object that moves through your graph, and a decision about which fields are per-run (state) versus durable (Postgres) versus hot-and-disposable (Redis). Get that boundary right and "agent memory" stops being mysterious.
Tool routing and the schema contract
Reliability comes from constraining the action space, not from prompt-tuning the model into compliance. An agent that can only emit one of five typed actions is dramatically more reliable than one you have begged to "please only respond in JSON."
Define each tool with a strict input schema and validate the model's call against it before executing. A malformed call is rejected and fed back, not run. This turns a class of silent failures into a loud, catchable one. Constrain first, prompt second. Every hour I have spent tightening a tool schema has paid back more than the equivalent hour spent rewording a system prompt.
Observability or it didn't happen
A multi-step agent run is invisible without tracing. I lean on Langfuse for exactly this across the agent systems I have run. Each LLM call, each tool invocation, and each node transition becomes a span you can inspect. When an answer is wrong, you open the trace and see which node produced the bad output instead of guessing.
The other half of keeping an agent graph healthy is testing it without burning money or flaking CI. The pattern I shipped on the journalistic accuracy scoring system is tiered evaluation:
- CI runs against mocked LLMs. Deterministic, fast, free. This validates graph wiring, state transitions, schema contracts, and routing logic, everything except model quality.
- A nightly run hits real cloud models. This is where you measure output quality, catch model drift, and validate prompts against reality.
This split is the difference between an agent system you can actually maintain and one whose test suite is too slow, flaky, and expensive to run, so it quietly stops running.
What I'd tell a team before their first multi-agent build
Start with one agent and good tools. Add a second agent only when context pressure, conflicting prompts, or tool-routing accuracy forces your hand, and name which one it is. When you add agents, route them through an explicit graph with a supervisor, not a free-talking swarm. Default to a DAG; reserve autonomy for the rare problem that truly needs it. Keep state in Postgres and Redis you already understand. Make tools strict and typed. Wire in tracing on day one, and tier your evals so CI stays fast and nightly catches reality.
The smallest architecture that survives production is almost always smaller than the one you were about to build.