How does Claude Code *actually* work?

Key Takeaways

A harness is the set of tools and environment that allows AI models to interact with your computer, executing commands and making file changes beyond just generating text
All AI coding tools work through a cycle: the model generates tool calls, the harness executes them, results are added to chat history, and the model continues with updated context
Claude Code's superior performance (77% to 93% improvement in benchmarks) over raw model usage comes primarily from its well-optimized harness implementation
Building a basic harness requires only about 200 lines of code with three core tools: read files, list files, and edit files
The quality of a harness depends heavily on carefully crafted system prompts and tool descriptions that guide model behavior
Context management through tool calls is more effective than trying to stuff entire codebases into the model's context window

Understanding AI Harnesses: The Foundation of Modern Coding Tools

Theo explores one of the most crucial yet misunderstood concepts in AI-powered development: harnesses. While terms like "agentic coding" often feel vague and meaningless, harnesses represent a concrete, technical implementation that dramatically impacts the quality of AI-generated code.

Recent benchmarks by Matt Mayer revealed striking performance differences when the same AI models operate within different harnesses. Claude Opus, for example, jumped from 77% accuracy in Claude Code to 93% in Cursor, with the harness being the only variable that changed.

What Exactly Is a Harness?

At its core, a harness is the set of tools and environment in which an AI agent operates. It's the bridge between an AI model's text generation capabilities and real-world computer interactions.

AI models are essentially sophisticated autocomplete systems, they predict the most likely next sequence of characters based on input text. They cannot inherently interact with files, run commands, or make system changes. The harness provides this capability through a structured system called tool calling.

The Tool Calling Mechanism

Tool calling works through a specific syntax system. When a model needs to perform an action, it generates structured text using predetermined tags. For example, to run a bash command, the model might output:

<bash_call>ls -la</bash_call>

Once the model generates this tool call, it stops responding. The harness then:

Parses the tool call syntax
Executes the requested action (with appropriate permissions)
Captures the output
Appends the results to the chat history
Makes a new request to the model to continue

This creates a cycle where the model's "brain" gets paused and restarted after every tool execution, building context incrementally through the chat history.

The Context Management Revolution

Moving Beyond Large Context Windows

Early AI coding approaches focused on cramming entire codebases into the model's context window. Tools like repo-mix compressed codebases into single XML files, creating what Theo describes as "the worst needle in a haystack problem imaginable."

This approach failed for several reasons:

Large context windows make models less accurate (accuracy drops to 50% when context exceeds 50,000-100,000 tokens)
It's expensive and computationally intensive
Models perform better with targeted, relevant information

Dynamic Context Building

Modern harnesses use a smarter approach. Instead of front-loading all information, they let models build context dynamically through tool calls. When asked "What is this app?", a model will:

Search for relevant files in the directory
Read key files like package.json to understand the project structure
Explore source directories based on initial findings
Build comprehensive understanding through multiple targeted queries

This approach is more efficient and accurate because models can focus on relevant information rather than processing massive amounts of potentially irrelevant code.

Building a Harness: The Technical Implementation

Core Components

Theo demonstrates that a functional harness requires surprisingly little code, around 200 lines of Python for a basic implementation. The essential components include:

Three Fundamental Tools:

Read File Tool: Allows the model to view file contents
List Files Tool: Enables directory navigation and file discovery
Edit File Tool: Permits code modifications and file creation

System Integration:

Tool registry to define available functions
System prompt that explains available tools to the model
Execution loop that handles tool calls and responses
Permission system for potentially destructive operations

The Power of Bash

Interestingly, Theo shows that even a single bash tool can replace all other specialized tools. Modern AI models are so well-trained on command-line interactions that they can effectively use bash commands to read files, navigate directories, and make changes, reducing the harness to its absolute minimum.

Why Some Harnesses Outperform Others

The Art of Prompt Engineering

The dramatic performance differences between harnesses (like Cursor vs. Claude Code) come down to meticulous optimization of:

System prompts: The initial instructions that set behavioral expectations
Tool descriptions: How each tool is explained to the model
Output formatting: The structure of tool responses
Model-specific tuning: Customized prompts for different AI models

Theo demonstrates this by modifying tool descriptions. Simply marking a tool as "deprecated" or suggesting an alternative can dramatically change model behavior. Different models (Claude, GPT, Gemini) respond differently to identical prompts, requiring harness creators to optimize for each model separately.

The Cursor Advantage

Cursor's superior performance stems from dedicated teams whose job is to continuously optimize these prompts. When new models release, they systematically test thousands of minor prompt variations to maximize performance. This level of optimization explains why using models through Cursor often feels dramatically better than using them directly.

The Broader Ecosystem

Tools vs. Harnesses

Theo clarifies an important distinction using his own product, T3 Code. T3 Code is not a harness, it's a UI layer that connects to existing harnesses like Claude Code and Codeium. When users select a model in T3 Code, they're actually using the corresponding company's harness through a more polished interface.

Industry Standardization

The concept of tool calling has become standardized across major AI providers. OpenAI, Anthropic, Google, and others now support dedicated tool calling APIs, making it easier for developers to build harnesses without dealing with complex parsing of model outputs.

The Future of AI Development Tools

The harness concept reveals that AI coding tools aren't magic, they're sophisticated but comprehensible systems built on well-understood principles. This democratization of knowledge means that:

Developers can build custom harnesses for specific use cases
The barrier to entry for AI tool creation is much lower than it appears
Competition will likely focus on optimization and user experience rather than fundamental technology barriers

As AI models continue improving, the quality of harnesses becomes increasingly important. The difference between a mediocre and exceptional AI coding experience often comes down to how well the harness guides model behavior through carefully crafted prompts and tool designs.

Our Analysis

While Theo's explanation captures the technical mechanics of harnesses, several critical limitations and market dynamics deserve deeper examination. Enterprise adoption remains constrained by security concerns, many organizations block tool-calling capabilities entirely, forcing teams to rely on less effective copy-paste workflows that negate harness advantages entirely.

The benchmark methodology cited also reveals significant blind spots. Matt Mayer's tests focused primarily on algorithmic coding challenges, but real-world development involves complex debugging, legacy system integration, and architectural decisions where harness performance varies dramatically. GitHub Copilot Workspace, launched in 2024, demonstrates superior performance on refactoring tasks through its specialized harness design, while Claude Code excels at greenfield development, a distinction the benchmarks miss.

Cost considerations present another overlooked challenge. Dynamic context building through multiple tool calls can generate 3-5x more API costs compared to traditional approaches. For teams processing large codebases daily, this translates to hundreds of dollars in monthly overhead that smaller companies cannot sustain.

The security implications of file system access also remain underexplored. Unlike sandboxed environments, harnesses operate with direct system permissions, creating potential attack vectors through prompt injection. Recent incidents at companies like Anthropic have highlighted how malicious code suggestions can exploit harness permissions to exfiltrate sensitive data.

Looking at competitive positioning, tools like Cursor's Composer and Replit's Agent are rapidly advancing harness capabilities beyond basic file operations. Cursor's 2025 release includes database integration and deployment tools, while Replit focuses on multi-service orchestration, capabilities that make simple three-tool harnesses seem increasingly limited.

The historical parallel to IDE evolution is telling: just as Visual Studio Code displaced simpler text editors through extensibility, the next generation of AI coding tools will likely be defined by harness sophistication rather than underlying model capabilities.

Frequently Asked Questions

Q: Why can't I just give the AI model access to my entire codebase at once?

While it might seem logical to provide complete context upfront, this approach actually makes models less accurate. Research shows that when context exceeds 50,000-100,000 tokens, model accuracy can drop to 50% of its original performance. Large context creates a "needle in a haystack" problem where models struggle to identify relevant information. Dynamic context building through tool calls is more efficient and accurate.

Q: How does tool calling actually work under the hood?

Tool calling works through structured text generation. The model generates specially formatted text (like XML tags or JSON) that the harness recognizes as commands. The model then stops responding, the harness executes the command, captures the output, and adds it to the chat history before requesting the model to continue. This creates a cycle of model reasoning, tool execution, and context building.

Q: What makes Cursor's harness better than others like Claude Code?

Cursor invests heavily in prompt optimization, with dedicated teams that continuously test and refine system prompts for each AI model. They customize tool descriptions, output formats, and behavioral guidelines to maximize performance. This level of optimization explains why the same AI model often performs significantly better through Cursor than through the model provider's own interface.

Q: Can I build my own harness, and how complex would it be?

Building a basic harness requires surprisingly little code, around 200 lines of Python for core functionality. You need three main components: tools for reading files, listing directories, and editing files, plus a system to handle the tool calling cycle. However, the real complexity lies in optimizing prompts and tool descriptions for reliable model behavior across different scenarios.

Products Mentioned

T3 Code: Theo's AI coding interface that provides a UI layer over existing harnesses like Claude Code and Codeium
Claude Code: Anthropic's AI coding harness that provides tools for file manipulation and code generation
Cursor: Popular AI coding editor with highly optimized harness implementation
Codeium: AI coding assistant with its own harness for code interaction
Macroscope: AI code reviewer and team insights platform that provides development analytics and Slack integration
repo-mix: Legacy tool (now largely obsolete) that compressed entire codebases into single XML files for AI processing

How does Claude Code actually work?