MARCO: the loop inside a harness, in code

4/19/2026•16 min read

AI Agents Architecture TypeScript Open Source

Update — May 2026. Since this post went up, MARCO has been published as two npm packages: marco-harness (the foundation described here) and marco-agent (a practical agent built on top). For the story behind both, plus the mental model the build crystallized, see A mental model for Claude Code (and every other modern agent).

Earlier this month I wrote that an AI agent is best understood as a loop inside a harness — not a list of parts (model, tools, memory, reasoning, human-in-the-loop).

That list names the parts. It hides the engine.

This week I shipped the companion code.

MARCO — Model-Agnostic Runtime for Controlled Orchestration. A small TypeScript agent harness whose whole point is to make the architectural split literal. Around 800 lines of core harness; another 600 or so in a worked example. Two runtime dependencies (@anthropic-ai/sdk and zod). Readable in an afternoon.

This post describes MARCO v0.1.0, released 2026-04-22. All code and file references below point to that tag. The repo will evolve; the post will not — so nothing here goes stale.

The original article walked the control flow in diagrams. This post walks the same control flow in code. Same beats, same order: start with the loop, watch the harness wrap it one moment at a time, then name the moments.

Start with the loop

The inner loop is one function. Around 120 lines. Each iteration does three things:

Optionally transform the messages before the model call (we’ll get to how in a minute).
Call the provider. Consume the stream. Capture the terminal message.
Route on the model’s stop reason. If the model is done, return. If the model wanted a tool, run the tool and go around again.

Here it is:


while (iteration < maxIterations) {
  // Phase 1 — optional harness overrides
  const harnessOverrides = await runHook(hooks.beforeModelCall, { messages, iteration, runId })
  if (harnessOverrides?.abort) return { status: 'aborted', ... }
  messages = harnessOverrides?.messages ?? messages
  config = harnessOverrides?.modelConfig ?? config
 
  // Phase 2 — call the model, consume the stream, capture the terminal message
  let assistantMessage
  for await (const event of provider.stream(messages, toolSpecs, config)) {
    if (event.type === 'message_end') assistantMessage = event.message
  }
 
  messages = [...messages, assistantMessage]
  iteration += 1
 
  // Phase 3 — route on stop reason
  switch (assistantMessage.stopReason) {
    case 'end_turn':
    case 'max_tokens':
      return { status: 'completed', finalMessage: assistantMessage, messages, iterations: iteration }
 
    case 'error':
    case 'safety':
      return { status: 'errored', ... }
 
    case 'tool_use': {
      const toolResults = []
      for (const call of assistantMessage.toolCalls) {
        toolResults.push(await requestToolCall(call))
      }
      messages = [...messages, ...toolResults]
      break  // continue the while loop
    }
  }
}

Two things about that code earn attention.

The tool_use case uses break, not return. The loop keeps going — which is what makes it an agent loop and not a single-shot completion. Model asks for a tool. Tool runs. Result goes back in. The loop takes another turn.

The loop never executes a tool itself. It calls requestToolCall(call) — a function it received as a dependency. The loop doesn’t know where that function came from, what it does, or whether permission was granted. Tool execution is someone else’s job.

That’s it. That’s the engine. Every subsequent section of this post is about the someone else — the harness that supplies requestToolCall, handles everything the loop punted on, and turns a 120-line iterator into a usable system.

What the loop can’t do alone

Re-read that code. Now notice everything it’s missing.

It doesn’t own a tool registry. toolSpecs and requestToolCall arrive as inputs. Something else knows which tools exist.

It doesn’t pick a model. modelConfig comes in.

It doesn’t authenticate or authorize. A call to runInnerLoop means the run is already allowed. Whoever decided that isn’t in this file.

It doesn’t persist anything. Messages accumulate in memory. When the loop returns, the history disappears unless someone catches it on the way out.

It doesn’t deliver the result. The loop returns {status: 'completed', finalMessage}. What happens next — streamed to a terminal, posted to a webhook, scheduled for a follow-up — isn’t specified here.

It doesn’t know anything about the user. No identity, no tenant, no budget, no rate limit.

Every gap has an answer, and the harness has them — not as an abstract concept but as a set of things that happen at specific moments. Here’s the harness doing its job, moment by moment, across one real turn.

One turn, step by step

To see the harness doing its job, we need something concrete to point at. MARCO ships with one worked example: mini-claude-code, a deliberately narrow CLI coding agent with four tools — bash, read, write, edit. It streams tokens to the terminal as the model produces them. It prompts before running anything that mutates state. It persists each conversation to a JSONL file on disk, so you can resume a session across restarts. It’s the minimum version of a coding assistant that still exercises the full loop-and-harness architecture this post is about — and the examples/mini-claude-code/ directory in the repo is where every piece of code referenced from here on actually lives.

A user types read README.md and summarize it into mini-claude-code, running from a TypeScript project directory that happens to have its own CLAUDE.md at the root. Here’s what happens, in order. Each moment where the harness steps in gets a name — and a snapshot of what the state looks like right after, so you can watch the message list grow as the turn unfolds.

Before any hook fires

The harness has received the trigger. A runId is generated. The trigger is converted into a single seed message. The default modelConfig is in hand. No other work has happened yet — no system prompt, no session history, no CLAUDE.md. This is the raw starting point the first hook will see.

State before onRunStart:


{
  "runId": "run_8f2a",
  "iteration": 0,
  "modelConfig": { "model": "claude-sonnet-4-6", "maxTokens": 4096 },
  "trigger": { "kind": "user_message", "text": "read README.md and summarize it" },
  "messages": [
    { "role": "user", "text": "read README.md and summarize it" }
  ]
}

Hydration: `onRunStart`

Something has to hydrate state. Load the prior conversation from disk so the agent has context. Prepend a system prompt. Pick up any project-level instructions from CLAUDE.md in the current directory. Check whether this user is allowed to run at all. Pick which model to call.

That’s a lot of work — but it all happens in a single moment, right before the loop begins. In MARCO that moment is a function called onRunStart. The harness fires it once; it receives the trigger and the seed messages; it returns the final starting state — or it rejects the run before it ever touches the model.

In mini-claude-code, onRunStart re-reads the JSONL session file at .marco/sessions/<id>.jsonl fresh each turn, prepends the system prompt, auto-loads ./CLAUDE.md as a second system message if one exists in the current directory, and appends the new user message to the session log on disk.

State after onRunStart:


{
  "runId": "run_8f2a",
  "iteration": 0,
  "modelConfig": { "model": "claude-sonnet-4-6", "maxTokens": 4096 },
  "messages": [
    { "role": "system", "text": "You are a helpful coding assistant. You have four tools: bash, read, write, edit. Be concise. Prefer reading before writing..." },
    { "role": "system", "text": "Project context from CLAUDE.md:\n\n# My Project\n\nTypeScript library using Vitest. Run `npm test` to verify changes. Keep dependencies minimal — anything new needs a written justification." },
    { "role": "user", "text": "read README.md and summarize it" }
  ]
}

Notice what changed between before and after onRunStart: one user message became three messages, with the system prompt and the project’s CLAUDE.md loaded into context. That’s real work — and it’s why the mini-claude-code name fits: picking up CLAUDE.md is something Claude Code does out of the box.

Before the first model call

The loop begins. Iteration 1 starts. But before the provider gets called, there’s another possible intervention point: what if the message history is too long to fit in the context window? What if you want to use a cheaper model for this specific iteration? What if the iteration budget just ran out?

That moment is beforeModelCall. It fires on every iteration — so you get N chances to shape the model call, not just one. In mini-claude-code it’s a no-op. In a production harness, context compaction lives here.

beforeModelCall returned: undefined — the minimal example doesn’t need this hook.

In a production harness, this is where context compaction lives. It might return something like this to compact a long history:


{
  "messages": [
    { "role": "system", "text": "You are a helpful coding assistant..." },
    { "role": "system", "text": "Earlier in this conversation: user asked about X, agent did Y, result was Z. [summary of 12 older turns]" },
    { "role": "user", "text": "read README.md and summarize it" }
  ]
}
// Or to downgrade the model for a cheap iteration:
// { "messages": [...], "modelConfig": { "model": "claude-haiku-4-5" } }
// Or to abort when an iteration budget is exceeded:
// { "messages": [...], "abort": true, "abortReason": "iteration budget (10) exceeded" }

The model runs

Provider streams events. Model emits a short preamble of text and then a tool call: read({ path: 'README.md' }). The loop captures the terminal message_end and sees stopReason: 'tool_use'.

State after the assistant message is appended:


{
  "iteration": 1,
  "messages": [
    // ...system prompt, CLAUDE.md system message, and user message above (3 messages)...
    {
      "role": "assistant",
      "text": "I'll read the README and summarize it.",
      "toolCalls": [
        { "id": "call_1", "name": "read", "input": { "path": "README.md" } }
      ],
      "stopReason": "tool_use",
      "usage": { "inputTokens": 180, "outputTokens": 28 }
    }
  ]
}

Before the tool actually runs

The loop wants to call requestToolCall. But before the tool handler actually fires, something should get to decide whether it’s allowed. Is read auto-approved? Does bash need a human [y/N] confirm? Should write show the full file content first? That decision is tool-specific — you want different policies for different operations.

That moment is beforeToolCall. It receives the tool call and returns a three-way decision: execute, deny, or short-circuit with a canned result. In mini-claude-code this hook implements the whole permission UX — and it’s different for each of the four tools. Here, the tool is read, so the hook returns { decision: 'execute' } without prompting.

beforeToolCall returned:


{ "decision": "execute" }

The tool runs

read handler opens README.md and returns its contents as a string.

Tool handler output (truncated):


# MARCO
 
**M**odel-**A**gnostic **R**untime for **C**ontrolled **O**rchestration.
 
A small TypeScript AI agent harness. Companion code to [*How AI agents work: a control flow breakdown*]...

Before the result goes back into context

The tool returned something. But maybe it returned too much (a 10,000-line file you want to summarize first). Maybe it returned something sensitive (API keys in the output you want to redact). Maybe you want to log every tool result for observability.

That moment is afterToolResult. No-op in this example; this is where redaction or logging would live in a production system.

afterToolResult returned: undefined — tool result passes through unchanged.

In a production harness, this is where redaction or output-shaping lives. It might return something like this to strip secrets:


{
  "result": "OPENAI_API_KEY=[REDACTED]\nDATABASE_URL=[REDACTED]\n... rest of file ...",
  "isError": false
}
// Or to summarize a huge output that would otherwise blow the context window:
// { "result": "[file content summarized: 12,400 lines of generated code]", "isError": false }

State after the tool result is appended:


{
  "iteration": 1,
  "messages": [
    // ...system, CLAUDE.md, user, assistant(tool_use) above (4 messages)...
    {
      "role": "tool",
      "toolCallId": "call_1",
      "content": "# MARCO\n\n**M**odel-**A**gnostic **R**untime for **C**ontrolled...",
      "isError": false
    }
  ]
}

Iteration 2 begins

The tool result is now in the message list. The loop increments iteration and goes around again. beforeModelCall fires a second time — same hook, same options, no-op here. The provider runs with the file contents in context. Model produces a summary. Stops with end_turn.

State after the second assistant message is appended:


{
  "iteration": 2,
  "messages": [
    // ...system, CLAUDE.md, user, assistant(tool_use), tool result above (5 messages)...
    {
      "role": "assistant",
      "text": "MARCO is a small TypeScript AI agent harness that makes the loop-inside-a-harness architecture literal in an API...",
      "toolCalls": [],
      "stopReason": "end_turn",
      "usage": { "inputTokens": 540, "outputTokens": 65 }
    }
  ]
}

After the loop returns

The loop returns { status: 'completed', finalMessage }. The run is over — but something still has to happen. Persist the assistant message to disk so the next turn can hydrate from it. Write any error to the user’s terminal. Record the cost. Schedule a follow-up run if one was queued.

That moment is onRunEnd. Fires once — whether the run completed, errored, or was aborted — with the final message, status, error, and full message log. In mini-claude-code it appends the assistant message to the JSONL file and, if the run errored, writes a [error] line to stdout.

Final result returned by Harness.run():


{
  "status": "completed",
  "iterations": 2,
  "finalMessage": {
    "role": "assistant",
    "text": "MARCO is a small TypeScript AI agent harness...",
    "stopReason": "end_turn"
  },
  "messages": [ /* full log of the turn — 6 messages total: system, CLAUDE.md, user, assistant(tool_use), tool result, assistant(end_turn) */ ]
}

Five moments, five hooks

Those are every hook MARCO exposes. Five named moments where the harness does work the loop can’t. Here they are as a flow — the amber nodes are the harness stepping in; the blue nodes are the loop doing its own work:

And as a reference:

Hook	When it fires	Role
`onRunStart`	Once, before the loop	Hydrate state, authenticate, pick the model, reject if blocked
`beforeModelCall`	Every iteration	Transform messages, compact context, swap model, abort
`beforeToolCall`	Every tool call	Gate execution — approve, deny, or short-circuit
`afterToolResult`	Every tool result	Transform the result before it re-enters context
`onRunEnd`	Once, after the loop	Persist, deliver, schedule, observe

Four map directly to nodes in the original article’s outer-loop diagram. beforeModelCall is MARCO’s addition: iteration-level concerns (compaction, per-call model routing) the original article didn’t name.

Non-goals, and what each one protects

Every agent framework ships with a feature list. MARCO ships with a refusal list — but each refusal earns its place by protecting one specific aspect of the loop/harness split:

Not durable. The harness controls run lifecycle, but cross-crash state persistence requires an external store that owns the durable-execution contract. Giving MARCO that responsibility would blur the loop/harness split this whole post is about. Compose with Inngest or Temporal when you need it.
Not multi-agent. Handoffs are tool calls in MARCO’s model — one agent’s tool is another agent’s entry point. A dedicated multi-agent orchestrator owns which agent is the current agent, which is a different responsibility than the one the harness codifies.
No memory backend. Memory isn’t a primitive in the original article’s model — it’s state managed through tools. Shipping a backend would privilege one shape of memory (vector store? graph? episodic log?) over the others and make MARCO opinionated in a dimension the harness doesn’t need to be.
No eval framework. Evals evaluate the agent’s outputs; the harness runs the agent. Two artifacts, two design pressures. Keeping them separate lets each stay sharp.
No production observability built in. The five hooks are the observability surface. OTel, Datadog, Honeycomb — those are adapters you write into afterToolResult and onRunEnd. Codifying one would pick winners in a space where the right answer is context-dependent.
No RAG stack. Retrieval is a tool concern. A search_docs tool handler can do vector lookup, grep, a full RAG pipeline, or a fourth thing nobody’s thought of yet. The loop doesn’t need to know.

Each exclusion preserves one specific boundary the harness relies on to stay thin. The refusal list isn’t taste — it’s what the architecture is made of.

What MARCO bought me when things broke

The architecture earns its keep on the bad days, not the good ones. Three bugs surfaced while dogfooding mini-claude-code — and in each case, where MARCO drew its lines turned a vague failure into a single suspect.

First real API call crashed; my provider tests had passed. The harness ran, the loop ran, and then the Anthropic SDK’s streaming events came back in a shape my fixtures hadn’t predicted. Because the entire SDK contract lives behind one interface — ModelProvider.stream in src/provider.ts — there was exactly one file the bug could be in. Twenty minutes from stack trace to a four-line fix in the event accumulator . The narrow interface wasn’t a style choice; it was the diagnostic.
Confirming a bash call killed the outer REPL. mini-claude-code would silently exit after a successful [y/N] instead of prompting for the next message. Permission UX lives entirely inside one named hook — beforeToolCall — so the suspect list was one function. The hook was creating a fresh readline per confirm and closing it, which cascaded through shared stdin and took the REPL down with it. The fix was a single readline created once in bin.ts and handed to the hook via closure. The harness, the loop, and every other hook stayed untouched — exactly what bounded blast radius is supposed to feel like.
The agent kept forgetting earlier turns. Conversations persisted to JSONL fine, but every new turn was hydrating from a stale snapshot captured at startup. onRunStart is the only place state hydrates in MARCO — so when history was missing, one function was wrong. Fix : re-read the JSONL fresh inside onRunStart each turn. The bug only existed because v0.1.0 deliberately leaves session management to the consumer (memory is tools, not a primitive) — but the named hook gave the consumer one obvious right place to put it.

The pattern: each bug had one suspect because of where MARCO drew its lines. That’s the architectural payoff — not “here’s a feature list,” but “when something breaks, you know which file owns it.” Each bug gets a standalone follow-up post.

A rubric, not a pitch

Read the code if you want to understand the architecture.

But the real takeaway isn’t “read MARCO.” It’s a test you can apply to any agent system from now on. Next time you read an agent library’s README, count three things:

How many named extension points exist. Two means the library is hiding the harness behind a monolithic Agent class — you won’t be able to intervene without patching internals. Fifteen means the abstractions are drowning the loop. The right number is small enough to memorize, large enough to intervene.
Where human approval lives. If the answer is “configure the Agent with an approval_callback,” the library isn’t distinguishing between permission gates (harness policy, enforced regardless of what the model asks) and clarification tools (model-initiated, requiring a specific tool registration). A library that collapses those two is one that hasn’t thought about the distinction.
What the library says it is not. Every serious framework should have a refusal list. If it doesn’t, the scope is unbounded — and you’re committing in advance to whatever the maintainers decide to add next.

That rubric will tell you more in thirty seconds than any feature table. Count the hooks. Find the approval. Read the refusals. Then decide.

Run it locally:


git clone --branch v0.1.0 git@github.com:pyrotank41/MARCO.git
cd MARCO && npm install
ANTHROPIC_API_KEY=sk-... npm run example

Repo (this post’s pinned version): github.com/pyrotank41/MARCO @ v0.1.0
Original article: How AI agents work: a control flow breakdown