Goals,
decomposed.
Safely.
The planning layer for autonomous agents. Turns goals into typed DAGs of tool calls. Every node carries a risk tier, a precondition, and a rollback. Replans on failure. Grounds in Recall.
ReAct loops are guess-and-pray.
Linear chains, exponential failure.
ReAct/CoT agents call one tool at a time, then re-prompt with the result. Each step is a fresh LLM gamble. A 12-step task at 92% per-step accuracy lands at 37% end-to-end. Nothing about the chain composes — every step inherits all prior uncertainty.
agent.step_1 → 0.92 agent.step_12 → 0.37 // 0.92¹² ≈ 0.37
No notion of risk.
The agent treats db.read and stripe.charge the same way. Irreversible actions trigger at the same confidence threshold as queries. There is no policy on which steps require review — the framework can't even express the question.
send_invoice(amount=99999) ✓ // no gate, no rollback, no log
Failures restart everything.
When step 9 of 12 fails — API timeout, schema drift, rate limit — most agents either crash or re-run the whole prompt. There's no plan structure to mutate. No concept of "redo just this subgraph with fresh context and the rest of the work intact."
retry → full prompt replay cost ×12, latency ×12, wasted progress 8 nodes
From goal
to typed DAG.
Plan synthesis is a five-stage compiler. A goal becomes a directed acyclic graph of typed nodes — each with a precondition, a risk tier, an expected tool, and a rollback. Recall provides the grounding context. The plan is not text. It is structure you can inspect, diff, and version.
Goal parseNL → INTENT
Goal string → typed Intent record. Extracts the verb (do / find / decide), the object, the constraints, and the success predicate. Ambiguous goals are rejected before any tool call — the planner refuses to synthesize what it can't measure.
GroundRECALL CONTEXT
Pulls relevant memories from Recall — user preferences, prior decisions, entity facts, last-N tool outcomes. Synthesis is grounded in what the agent already knows. Planning blind on unfamiliar context is the single biggest cause of bad plans.
DecomposeINTENT → DAG
Recursive goal-decomposition until every leaf maps to a single tool. Edges encode data dependency, not just order — siblings can run in parallel. Cycles are rejected. Depth is bounded; over-decomposition surfaces as a synthesis error.
TypeRISK + ROLLBACK
Each node is assigned a risk tier (READ / WRITE / IRREVERSIBLE / HUMAN-IN-LOOP) and a rollback action where one exists. The policy engine attaches gates — confidence thresholds, dollar limits, approval requirements. Tiering happens before binding, so policy can refuse a plan.
BindNODE → MCP TOOL
Each leaf is bound to a concrete MCP tool. Parameters are resolved from upstream node outputs and Recall context. Type-checked against the tool schema. Missing parameters surface as planning errors before execution starts — never as runtime null-arg crashes.
Every node is eleven fields.
A Plan node is not a string. It's a typed record. Eleven fields describe it; every one of them is queryable, diffable, and persisted with the receipt.
REVERSIBILITY
Every WRITE declares its undo. If the tool can't define a rollback, the planner promotes the node to IRREV.
PARAM RESOLUTION
$n1.id is a reference, not a substitution. The executor passes the actual upstream value at runtime.
IDEMPOTENCY
Every node carries an idempotency key. The executor short-circuits on replay if the key has been seen.
Not every step
is equal.
Plan classifies every node into one of four tiers, each with its own confidence requirement and policy.
Read
Queries, retrievals, computations. No external mutation. The cheapest tier — runs autonomously.
Write
Internal state mutations. The agent's own database, the user's draft folder. Reversible via undo.
Irreversible
Sends an email, deploys a build, charges a card. No rollback. Policy-gated and opt-in.
Human
Pauses the plan. Surfaces the node, its inputs, and predicted effect. Blocking.
Plans are code,
not prose.
A Plan is a serializable object. Every node is bound to an MCP tool with concrete parameters. The agent doesn't "decide what to do next" — it executes a graph it can prove is well-formed.
// goal: "schedule a call with Casey next week"
Plan {
id: "plan_3f2a",
goal: "schedule call with Casey",
nodes: [
{ id: "n1", tool: "recall.search",
args: { q: "Casey" },
tier: READ, deps: [] },
{ id: "n2", tool: "calendar.free",
args: { who: "$n1.id" },
tier: READ, deps: ["n1"] },
{ id: "n3", tool: "email.send",
args: { body: "$n2" },
tier: IRREV,
gate: { approval: true },
deps: ["n2"] }
]
}Returned 3 entities. Best match: person/casey-park
4 free 30-min slots. Tue 2pm, Wed 10am, Thu 3pm, Fri 11am.
Blocked on human gate. Will surface preview when N2 completes.
Parallel where it can,
serial where it must.
The DAG is the schedule. Nodes with disjoint dependency closures run in parallel automatically.
Failure is a signal,
not a stack trace.
When a node fails, the planner doesn't crash and doesn't replay the whole prompt. It mutates the affected subgraph in place and resumes. Upstream work is preserved. Downstream work is invalidated and re-typed.
SCOPE
Subgraph diff, not full replay. Only nodes whose dependency closure includes the failed node get re-typed.
COST
Average reduction in re-execution cost across the bench suite.
LATENCY
Time-to-recover from a single mid-plan failure.
The numbers we're chasing.
Plan v0.1 alpha targets below. Numbers are from our internal bench suite — 240 multi-tool agentic tasks.
END-TO-END TASK ACCURACY
% tasks completed without operator intervention
UNSAFE-ACTION RATE
% runs that fired an irreversible tool without gate
Same DNA
as Recall.
Plan ships under the same engineering principles as the rest of the Arc Labs cognitive stack. Rust core. Open core.
The synthesis compiler, the DAG executor, the policy engine, the rollback machine. Single binary. Zero-allocation in the hot path.
Plan is not useful without memory. Synthesis pulls from Recall for grounding; execution writes outcomes back.
Tools are MCP servers. Plan reads their schemas and validates parameter resolution at synthesis time.
Policies are configuration, not prompts. A YAML file declares which tiers gate, dollar limits, and approval requirements.
First-class clients in Node, Python, and Go. Inspect, edit, replay plans from any runtime.
Plan vs.
the alternatives.
Most "agent frameworks" ship orchestration code, not a planner. Plan is the missing primitive — it's what should sit between LangGraph and your tools.
| Feature | Plan | LangGraph | AutoGen | ReAct |
|---|---|---|---|---|
| Plan structure how steps relate | Typed DAG | Hand-coded graph | Conversation loop | Linear chain |
| Risk tiers node-level policy | 4 tiers, declarative | None | None | None |
| Replan on failure subgraph-level | In-place mutation | Manual edge re-route | Re-prompt loop | Restart prompt |
| Memory grounding context at synthesis | Recall-native | BYO retriever | BYO retriever | None |
| Human gates approval flow | Built-in, typed | Custom node | Custom message | None |
Q3 2026.
Closed alpha.
Plan v0.1 ships to ~40 design partners building agent products in production. If you have a real workload — not a demo — we'd like to hear from you. Tell us your tools, your tiers, and the one task that has to ship.
import { Planner } from "@arc-labs/plan";
import { Recall } from "@arc-labs/recall";
const planner = new Planner({
recall: new Recall(),
tools: mcpFleet,
policy: "./policy.yaml",
});
const plan = await planner.synthesize({
goal: "schedule call with Casey next week",
});
// inspect it before you run it
console.log(plan.nodes);
await planner.execute(plan);