Expressiveness vs Tractability in Tool Design

In the previous breadcrumb, I argued that a task is agent-shaped when the path from start to end is locally verifiable, locally coherent, and convergent. I also briefly mentioned meta-tasks: the idea that an agent might succeed at a task not by approximating the solution directly, but by recognizing the problem type and constructing a solver for it. I said that was worth its own post. This is a first step in that direction.

The question I want to get at is this: when you’re building a harness for an agent, how should you design the tools you give it? It’s one of the core engineering decisions in harness design, and I think it turns on a tradeoff that hasn’t been articulated crisply enough.

The tradeoff

On one end, you can give the agent a narrow, structured set of tools. The action space is small, every action is well-defined, and the task is shaped in a way that’s easy to reason about. I’ll call this end of the axis tractability: how constrained and search-friendly the interface is.

This is where most current tool infrastructure lives. MCP¹ is built around well-typed, schema-defined tools with structured inputs and outputs.

On the other end, you can give the agent expressive, general-purpose capabilities. Read files, write code, execute programs. The action space balloons, but the agent gets the freedom to construct its own approach. I’ll call this end expressiveness: how much the agent can directly do, build, or delegate.

These pull against each other. More expressive tool surfaces give the agent more freedom, but the search space grows and gets noisier. More tractable surfaces constrain what the agent can do, but make good action selection easier, at least in principle.

One of the most dramatic inflection points in real and perceived agentic capability came from an engineering decision along this axis: “bash is all you need.”²

But that result came from coding, a domain where most tasks are already agent-shaped and the work itself is expressive by nature. Programmers construct their own approaches as a matter of course, and giving a coding agent shell access gives it a tool surface that matches the shape of the task. Most enterprise workflows look nothing like that. They exist because the work has already been decomposed into well-defined steps with clear decision points; the organization has already done the work of making the task tractable. In those settings, an expressive tool surface may be actively counterproductive. It invites the agent to re-solve a problem the organization has already solved, while making the system harder to audit, constrain, and deploy. The governance story may matter more than the capability story.

The recent SKILL.md spec³ is an interesting case because it doesn’t sit cleanly on this axis. A skill is a markdown file loaded into context that tells the agent how to do something, usually by leaning on expressive tools it already has. The action surface stays expressive; the instructions shape how the agent uses it. That’s a different than either end of the tradeoff I’m describing — not narrowing the tools (what the agent can do), but narrowing the policy over an already expressive surface (what the agent actually does with it). I think there’s more to say about that in a future post.

Every time you choose what tools to expose to an agent, you’re making a design decision somewhere along this tradeoff.

Harness design surface

When you build a harness for an agent, you’re making decisions across at least three dimensions:

Context: what information the agent can see. System prompts, retrieved documents, conversation history, task descriptions.
Tools: what the agent can do. The functions it can call, their signatures, their side effects.
Control loop: how the agent’s execution is orchestrated. Turn limits, retry logic, checkpoints, escalation policies, multi-agent handoffs.

These interact, but they’re separable enough to study independently. In the experiment below, I hold context and control loop constant. The system prompt gives every agent the same task description. The control loop is the same turn-based loop with the same timeout. The model is the same across every run (gpt-5.2). The only thing that varies is the tool surface.

That isolates the question I want: along the axis of expressiveness vs tractability, how does the shape of the tool interface affect agent performance on a fixed task?

The setup

The task is maze pathfinding. A start, a goal, a 2D grid with walls, a known optimal algorithm (A*). What varies across conditions is the agent’s interface to the problem.

The shape of this experiment is inspired by Mario Zechner’s MCP-vs-CLI benchmarks⁴, though his setup is considerably more rigorous than the preliminary one I’m running here.

Four conditions, positioned along the expressiveness/tractability axis:

Structured primitives: look_around() and move(direction). Low expressiveness, high tractability. The agent explores locally and has to accumulate state over time.
Structured planning aids: adds get_current_position() and get_local_map(radius). A bit more expressive. Bookkeeping help without handing over a solver.
Algorithm-shaped interface: exposes neighbors, distance estimates, adjacent moves. The tool surface resembles graph search without naming A*. The hope was that the shape would nudge the model toward search-like behavior.
Open-ended expressive access: read_file(), write_file(), run_bash(). The agent can inspect the maze data, write code, execute it. No maze-specific planning tools at all.

Plus a deterministic A* solver as a reference. All agent conditions ran on gpt-5.2 via Portkey with a 60-second timeout.

Results

Medium maze (`branching-maze`)

Condition	Result	Time (s)	Moves	Tokens
A* baseline	solved	0.00	28	-
Primitives	timeout	60.02	43	68,195
Planning aids	timeout	60.01	53	106,424
Algorithm-shaped	timeout	60.01	20	77,654
Expressive	solved	15.87	28	7,172

Large maze (`branching-maze-large`)

Condition	Result	Time (s)	Moves	Tokens
A* baseline	solved	0.00	520	-
Primitives	timeout	60.02	56	112,536
Planning aids	timeout	60.02	53	110,010
Algorithm-shaped	timeout	60.01	25	125,421
Expressive	solved	11.27	520	26,833

What the runs looked like

The structured primitives condition explored nontrivially. The agent was clearly trying, but couldn’t convert local exploration into a successful global path inside the time budget.

Medium maze: A* baseline

Medium maze: structured primitives

Structured primitives on the medium branching maze

The planning aids condition didn’t do better despite having more information. Extra bookkeeping support pushed token usage up without producing a completion.

The algorithm-shaped interface produced more coherent search-like behavior than the primitives, but still couldn’t reconstruct full frontier-based search reliably enough to finish.

Medium maze: algorithm-shaped interface

Algorithm-shaped interface on the medium branching maze

The expressive condition did something different. Rather than trying to solve the maze step by step through tool calls, the agent read the maze file, wrote a solver script, ran it, and reported the solution. It externalized the search into deterministic code.

Medium maze: open-ended expressive access

Open-ended expressive access on the medium branching maze

The same pattern held on the large maze. The structured conditions timed out. The expressive condition solved it by writing code.

Large maze: A* baseline

Large maze: structured primitives

Structured primitives on the large branching maze

Large maze: open-ended expressive access

Open-ended expressive access on the large branching maze

The meta-task reading

Pathfinding was one of the examples I used in the previous breadcrumb for a task that isn’t agent-shaped. Local progress does not imply global progress, and without a heuristic there is no guarantee of convergence. An agent stepping through the maze one tool call at a time is doing approximate search on a problem that punishes approximate search. That is what the three structured conditions were asked to do, and all three timed out.

The expressive agent handled a different task. It recognized the problem as pathfinding, wrote a solver, ran it, and read the answer. That meta-task is agent-shaped. Each step is locally verifiable, the sequence is coherent, and each step moves toward a working solver.

If this reading is right, the expressive condition did not make the maze easier. It made a different problem reachable, one the model already knows how to solve because it has seen many pathfinding implementations.

That brings us back to the tradeoff I started with. Tractability is not only about keeping the search space small. It also determines which meta-problems the interface makes available. In the structured conditions, the agent could act inside the task, but it could not step outside it and construct a solver.

Limitations

One of the biggest gaps here is that I only ran two mazes, and both are branching mazes of different sizes. That means they share the structure that makes the meta-task viable: a well-known problem type, a standard algorithmic solution, and a representation an agent can dump to a file and operate on.

Some other limitations:

No real sandboxing for the expressive condition. run_bash is too permissive, and the agent could potentially inspect files outside the run workspace that may bias its approach one way or the other.
Single run per condition. LLMs are inherently stochastic, so we expect there to be run-to-run variance that isn’t captured here.
Timeout sensitivity. The failure mode for three of the four conditions is “didn’t finish in 60s.” Different timeout budgets could change the apparent ranking.
The expressive condition changes more than one variable at once. It increases expressiveness, but it also changes the task into a meta-problem of whether the agent notices it should write a solver.

This is a work in progress. I plan to keep building on this same experiment over time, adding mazes, varying conditions, and probing the meta-task hypothesis more directly. I’ll post updates here as the picture fills in.

See the Model Context Protocol documentation. ↩︎
Mario Zechner has made this case more forcefully than most. See “What If You Don’t Need MCP?” and his conversation on Syntax. ↩︎
See Anthropic, “Equipping Agents for the Real World with Agent Skills”. ↩︎
See Mario Zechner, “MCP vs CLI”. ↩︎