The Problem with Tool Calling

It turns out we've all been using tool-calling protocols wrong. Most AI agents today expose tools directly to LLMs using special tokens and JSON schemas. But there's a better way: let the LLM write code that calls an API.

This article explores CodeMode UTCP - an implementation of this approach for the Universal Tool Calling Protocol that uses the Yaegi Go interpreter to safely execute LLM-generated code.


What's Wrong with Traditional Tool Calling?

Under the hood, when an LLM uses a "tool," it generates special tokens that don't represent actual text. The LLM is trained to output these tokens in a specific format:

I will check the weather in Austin, TX.
<|tool_call|> 
{ 
  "name": "get_weather", 
  "arguments": { "location": "Austin, TX" } 
} 
<|end_tool_call|>

The harness intercepts these special tokens, parses the JSON, calls the tool, and feeds the result back:

<|tool_result|> 
{ 
  "temperature": 93, 
  "conditions": "sunny" 
} 
<|end_tool_result|>

The Core Issue

These special tokens are things LLMs have never seen in the wild. They must be specially trained using synthetic data. As a result:

  • Limited capability - LLMs struggle with complex tools or too many options
  • Simplified APIs - Tool designers must "dumb down" their interfaces
  • Poor reliability - The LLM may choose the wrong tool or use it incorrectly

Meanwhile, LLMs are excellent at writing code. They've seen millions of real-world code examples from open source projects. Writing code and calling tools are almost the same thing - so why can LLMs do one much better than the other?

Making an LLM perform tasks with tool calling is like putting Shakespeare through a month-long class in Mandarin and then asking him to write a play in it. It's just not going to be his best work.


The Code Mode Solution

Instead of asking the LLM to use special tokens, we:

  1. Convert tools into a code API - Generate type-safe function signatures from tool schemas
  2. Ask the LLM to write code - Let it use its natural code-writing abilities
  3. Execute in a sandbox - Run the generated code safely with access only to the tool API

The results are striking. LLMs can now:

  • Handle complex, multi-step workflows naturally
  • Use full-featured APIs without simplification
  • Implement proper error handling and retry logic
  • Chain tool calls with conditional logic

CodeMode UTCP Architecture

Our Go implementation consists of two main components:

1. The Orchestrator (orchestrator.go)

An LLM-driven pipeline that:

  • Determines if tools are needed for a query
  • Selects relevant tools from available UTCP tools
  • Generates Go code using only selected tools
  • Validates generated code against strict rules

2. The Execution Engine (codemode.go)

A sandboxed runtime that:

  • Uses Yaegi Go interpreter for safe execution
  • Injects helper functions for UTCP tool access
  • Normalizes LLM output into valid Go programs
  • Enforces timeouts and captures output

The Four-Step Pipeline

When a user makes a request, CodeMode UTCP executes four distinct steps:

Step 1: Decide if Tools Are Needed

First, we ask the LLM whether the query requires external tools at all:

func (cm *CodeModeUTCP) decideIfToolsNeeded(
    ctx context.Context,
    query string,
    tools string,
) (bool, error) {
    prompt := fmt.Sprintf(`
Decide if the following user query requires using ANY UTCP tools.

USER QUERY: %q
AVAILABLE UTCP TOOLS: %s

Respond ONLY in JSON: { "needs": true } or { "needs": false }
`, query, tools)
    
    raw, err := cm.model.Generate(ctx, prompt)
    // ... parse JSON response
}

This prevents unnecessary tool calls for simple queries like "What is 2+2?"

Step 2: Select Relevant Tools

Next, we identify which specific tools are needed:

func (cm *CodeModeUTCP) selectTools(
    ctx context.Context,
    query string,
    tools string,
) ([]string, error) {
    prompt := fmt.Sprintf(`
Select ALL UTCP tools that match the user's intent.

USER QUERY: %q
AVAILABLE UTCP TOOLS: %s

Respond ONLY in JSON:
{
  "tools": ["provider.tool", ...]
}

Rules:
- Use ONLY names listed above
- NO modifications, NO guessing
- If multiple tools apply, include all
`, query, tools)
    
    raw, err := cm.model.Generate(ctx, prompt)
    // Returns: ["math.add", "math.multiply"]
}

This narrows the context to only relevant tools, improving code generation quality.

Step 3: Generate Go Code

Now comes the magic - the LLM writes actual Go code:

// Example generated code for: "Get sum of 5 and 7, then multiply by 3"

r1, err := codemode.CallTool("math.add", map[string]any{
    "a": 5,
    "b": 7,
})
if err != nil { return err }

var sum any
if m, ok := r1.(map[string]any); ok {
    sum = m["result"]
}

r2, err := codemode.CallTool("math.multiply", map[string]any{
    "a": sum,
    "b": 3,
})

__out = map[string]any{
    "sum": sum,
    "product": r2,
}

Key constraints enforced:

  • Use only selected tool names (no inventing tools)
  • Use exact input/output schema keys from tool specs
  • No package/import declarations (added automatically)
  • Assign final result to __out variable
  • Use only provided helper functions

Step 4: Execute in Sandbox

Finally, we execute the code safely:

func (c *CodeModeUTCP) Execute(
    ctx context.Context,
    args CodeModeArgs,
) (CodeModeResult, error) {
    // Enforce timeout
    ctx, cancel := context.WithTimeout(ctx, 
        time.Duration(args.Timeout)*time.Millisecond)
    defer cancel()
    
    i, stdout, stderr := newInterpreter()
    
    // Inject UTCP helpers
    injectHelpers(i, c.client)
    
    // Wrap and prepare code
    wrapped := c.prepareWrappedProgram(args.Code)
    
    // Run in goroutine with panic recovery
    done := make(chan evalResult, 1)
    go func() {
        defer func() {
            if r := recover(); r != nil {
                done <- evalResult{err: fmt.Errorf("panic: %v", r)}
            }
        }()
        
        v, err := i.Eval(wrapped)
        done <- evalResult{val: v, err: err}
    }()
    
    // Wait for completion or timeout
    select {
    case <-ctx.Done():
        return CodeModeResult{}, fmt.Errorf("timeout")
    case res := <-done:
        return CodeModeResult{
            Value: res.val.Interface(),
            Stdout: stdout.String(),
            Stderr: stderr.String(),
        }, res.err
    }
}

Making LLM Output Executable

LLMs don't always generate perfect Go code. We apply several normalization steps:

1. Package/Import Stripping

LLMs often include package main or import statements. We strip these since the wrapper adds them automatically.

2. Walrus Operator Conversion

func convertOutWalrus(code string) string {
    // Converts: __out := ... 
    // To:       __out = ...
    re := regexp.MustCompile(`__out\s*:=`)
    return re.ReplaceAllString(code, "__out = ")
}

The __out variable is pre-declared, so := would cause a redeclaration error.

3. JSON to Go Literal Conversion

LLMs sometimes output JSON objects instead of Go map literals:

func toGoLiteral(v any) string {
    switch val := v.(type) {
    case map[string]any:
        parts := make([]string, 0, len(val))
        for k, v2 := range val {
            parts = append(parts, 
                fmt.Sprintf("%q: %s", k, toGoLiteral(v2)))
        }
        return fmt.Sprintf("map[string]any{%s,}", 
            strings.Join(parts, ", "))
    case []any:
        items := make([]string, len(val))
        for i := range val {
            items[i] = toGoLiteral(val[i])
        }
        return fmt.Sprintf("[]any{%s}", strings.Join(items, ", "))
    case string:
        return fmt.Sprintf("%q", val)
    // ... other cases
    }
}

Injecting the Tool API

The sandboxed environment needs access to UTCP tools. We inject helper functions using Yaegi's reflection-based exports:

func injectHelpers(i *interp.Interpreter, client utcp.UtcpClientInterface) error {
    i.Use(stdlib.Symbols) // Load Go standard library
    
    exports := interp.Exports{
        "codemode_helpers/codemode_helpers": map[string]reflect.Value{
            "CallTool": reflect.ValueOf(func(name string, args map[string]any) (any, error) {
                return client.CallTool(context.Background(), name, args)
            }),
            
            "CallToolStream": reflect.ValueOf(func(name string, args map[string]any) (*codeModeStream, error) {
                stream, err := client.CallToolStream(context.Background(), name, args)
                if err != nil {
                    return nil, err
                }
                return &codeModeStream{next: stream.Next}, nil
            }),
            
            "SearchTools": reflect.ValueOf(func(query string, limit int) ([]tools.Tool, error) {
                return client.SearchTools(query, limit)
            }),
            
            "Sprintf": reflect.ValueOf(fmt.Sprintf),
            "Errorf": reflect.ValueOf(fmt.Errorf),
        },
    }
    
    return i.Use(exports)
}

These functions become available in generated code as codemode.CallTool(), codemode.Sprintf(), etc.


Security: Sandboxing LLM-Generated Code

Running code written by an LLM requires careful isolation:

1. Yaegi Interpreter

  • No filesystem access - Unless explicitly provided via helpers
  • No network access - Isolated from the Internet
  • No process spawning - Cannot execute external commands
  • Isolated namespace - Runs in same process but separate context

2. Timeout Enforcement

Default 30-second timeout prevents infinite loops and runaway execution.

3. Panic Recovery

Interpreter panics are caught and returned as structured errors, preventing crashes.

4. Validation Rules

Basic syntax validation ensures code contains required elements before execution.


Real-World Example

Let's walk through a complete workflow:

User Query: "Search for Python tutorials and summarize the top 3 results"

Step 1: Decide Tools Needed

Result: { "needs": true }

Step 2: Select Tools

Result: { "tools": ["search.web", "text.summarize"] }

Step 3: Generate Code

// Search for tutorials
searchResult, err := codemode.CallTool("search.web", map[string]any{
    "query": "Python tutorials",
    "limit": 3,
})
if err != nil { return err }

var results []any
if m, ok := searchResult.(map[string]any); ok {
    if r, ok := m["results"].([]any); ok {
        results = r
    }
}

// Extract top 3 and summarize
var summaries []any
for i := 0; i < 3 && i < len(results); i++ {
    item := results[i]
    
    var text string
    if m, ok := item.(map[string]any); ok {
        if t, ok := m["content"].(string); ok {
            text = t
        }
    }
    
    summary, err := codemode.CallTool("text.summarize", map[string]any{
        "text": text,
        "max_length": 100,
    })
    if err != nil { continue }
    
    summaries = append(summaries, summary)
}

__out = map[string]any{
    "query": "Python tutorials",
    "summaries": summaries,
}

Step 4: Execute

The code runs in the Yaegi sandbox, calls the actual UTCP tools, and returns structured results—all in one execution.


Why This Works Better

CodeMode UTCP demonstrates that code is a better interface for tool orchestration than special tokens. By leveraging LLMs' natural code-writing abilities, we achieve:

  • Lower latency - One execution vs. multiple round-trips
  • Better composability - Express complex workflows naturally
  • Robust error handling - Standard Go error patterns
  • Full API access - No need to simplify tool interfaces
  • Higher reliability - LLMs excel at writing code
  • Debuggability - Inspect generated code and execution logs

The key insight: LLMs have seen millions of code examples but only synthetic tool-calling data. We should play to their strengths.


Performance Characteristics

Latency Breakdown

  • LLM calls: 3 sequential calls (decide, select, generate) ≈ 2-5 seconds
  • Code execution: Typically <100ms for simple workflows
  • Tool calls: Depends on tool implementation

Optimization Opportunities

  • Parallel LLM calls - Decide + Select could run concurrently
  • Caching - Cache tool specs to reduce prompt size
  • Streaming - Stream code generation for faster perceived latency
  • Compiled mode - Pre-compile common patterns

Limitations and Trade-offs

Current Limitations

  • Single-threaded execution - Yaegi runs code sequentially
  • No filesystem access - Unless explicitly provided via helpers
  • LLM quality dependency - Bad code generation = runtime errors
  • Debugging complexity - Stack traces from interpreted code can be cryptic

Design Trade-offs

  • Safety vs. Flexibility - Sandboxing limits what code can do
  • Simplicity vs. Power - Go DSL is more constrained than Python
  • Latency vs. Reliability - Multiple LLM calls increase latency but improve correctness

Try It Yourself

CodeMode UTCP is part of the go-utcp repository:

git clone https://github.com/universal-tool-calling-protocol/go-utcp
cd go-utcp/src/plugins/codemode
go test -v

Check out the README for more examples and API documentation.


Related Work

CodeMode UTCP builds on ideas from:

Special thanks to the UTCP community for feedback and contributions.