The Problem with Tool Calling
It turns out we've all been using tool-calling protocols wrong. Most AI agents today expose tools directly to LLMs using special tokens and JSON schemas. But there's a better way: let the LLM write code that calls an API.
This article explores CodeMode UTCP - an implementation of this approach for the Universal Tool Calling Protocol that uses the Yaegi Go interpreter to safely execute LLM-generated code.
What's Wrong with Traditional Tool Calling?
Under the hood, when an LLM uses a "tool," it generates special tokens that don't represent actual text. The LLM is trained to output these tokens in a specific format:
I will check the weather in Austin, TX.
<|tool_call|>
{
"name": "get_weather",
"arguments": { "location": "Austin, TX" }
}
<|end_tool_call|>
The harness intercepts these special tokens, parses the JSON, calls the tool, and feeds the result back:
<|tool_result|>
{
"temperature": 93,
"conditions": "sunny"
}
<|end_tool_result|>
The Core Issue
These special tokens are things LLMs have never seen in the wild. They must be specially trained using synthetic data. As a result:
- Limited capability - LLMs struggle with complex tools or too many options
- Simplified APIs - Tool designers must "dumb down" their interfaces
- Poor reliability - The LLM may choose the wrong tool or use it incorrectly
Meanwhile, LLMs are excellent at writing code. They've seen millions of real-world code examples from open source projects. Writing code and calling tools are almost the same thing - so why can LLMs do one much better than the other?
Making an LLM perform tasks with tool calling is like putting Shakespeare through a month-long class in Mandarin and then asking him to write a play in it. It's just not going to be his best work.
The Code Mode Solution
Instead of asking the LLM to use special tokens, we:
- Convert tools into a code API - Generate type-safe function signatures from tool schemas
- Ask the LLM to write code - Let it use its natural code-writing abilities
- Execute in a sandbox - Run the generated code safely with access only to the tool API
The results are striking. LLMs can now:
- Handle complex, multi-step workflows naturally
- Use full-featured APIs without simplification
- Implement proper error handling and retry logic
- Chain tool calls with conditional logic
CodeMode UTCP Architecture
Our Go implementation consists of two main components:
1. The Orchestrator (orchestrator.go)
An LLM-driven pipeline that:
- Determines if tools are needed for a query
- Selects relevant tools from available UTCP tools
- Generates Go code using only selected tools
- Validates generated code against strict rules
2. The Execution Engine (codemode.go)
A sandboxed runtime that:
- Uses Yaegi Go interpreter for safe execution
- Injects helper functions for UTCP tool access
- Normalizes LLM output into valid Go programs
- Enforces timeouts and captures output
The Four-Step Pipeline
When a user makes a request, CodeMode UTCP executes four distinct steps:
Step 1: Decide if Tools Are Needed
First, we ask the LLM whether the query requires external tools at all:
func (cm *CodeModeUTCP) decideIfToolsNeeded(
ctx context.Context,
query string,
tools string,
) (bool, error) {
prompt := fmt.Sprintf(`
Decide if the following user query requires using ANY UTCP tools.
USER QUERY: %q
AVAILABLE UTCP TOOLS: %s
Respond ONLY in JSON: { "needs": true } or { "needs": false }
`, query, tools)
raw, err := cm.model.Generate(ctx, prompt)
// ... parse JSON response
}
This prevents unnecessary tool calls for simple queries like "What is 2+2?"
Step 2: Select Relevant Tools
Next, we identify which specific tools are needed:
func (cm *CodeModeUTCP) selectTools(
ctx context.Context,
query string,
tools string,
) ([]string, error) {
prompt := fmt.Sprintf(`
Select ALL UTCP tools that match the user's intent.
USER QUERY: %q
AVAILABLE UTCP TOOLS: %s
Respond ONLY in JSON:
{
"tools": ["provider.tool", ...]
}
Rules:
- Use ONLY names listed above
- NO modifications, NO guessing
- If multiple tools apply, include all
`, query, tools)
raw, err := cm.model.Generate(ctx, prompt)
// Returns: ["math.add", "math.multiply"]
}
This narrows the context to only relevant tools, improving code generation quality.
Step 3: Generate Go Code
Now comes the magic - the LLM writes actual Go code:
// Example generated code for: "Get sum of 5 and 7, then multiply by 3"
r1, err := codemode.CallTool("math.add", map[string]any{
"a": 5,
"b": 7,
})
if err != nil { return err }
var sum any
if m, ok := r1.(map[string]any); ok {
sum = m["result"]
}
r2, err := codemode.CallTool("math.multiply", map[string]any{
"a": sum,
"b": 3,
})
__out = map[string]any{
"sum": sum,
"product": r2,
}
Key constraints enforced:
- Use only selected tool names (no inventing tools)
- Use exact input/output schema keys from tool specs
- No package/import declarations (added automatically)
- Assign final result to
__outvariable - Use only provided helper functions
Step 4: Execute in Sandbox
Finally, we execute the code safely:
func (c *CodeModeUTCP) Execute(
ctx context.Context,
args CodeModeArgs,
) (CodeModeResult, error) {
// Enforce timeout
ctx, cancel := context.WithTimeout(ctx,
time.Duration(args.Timeout)*time.Millisecond)
defer cancel()
i, stdout, stderr := newInterpreter()
// Inject UTCP helpers
injectHelpers(i, c.client)
// Wrap and prepare code
wrapped := c.prepareWrappedProgram(args.Code)
// Run in goroutine with panic recovery
done := make(chan evalResult, 1)
go func() {
defer func() {
if r := recover(); r != nil {
done <- evalResult{err: fmt.Errorf("panic: %v", r)}
}
}()
v, err := i.Eval(wrapped)
done <- evalResult{val: v, err: err}
}()
// Wait for completion or timeout
select {
case <-ctx.Done():
return CodeModeResult{}, fmt.Errorf("timeout")
case res := <-done:
return CodeModeResult{
Value: res.val.Interface(),
Stdout: stdout.String(),
Stderr: stderr.String(),
}, res.err
}
}
Making LLM Output Executable
LLMs don't always generate perfect Go code. We apply several normalization steps:
1. Package/Import Stripping
LLMs often include package main or import statements. We strip these since
the wrapper adds them automatically.
2. Walrus Operator Conversion
func convertOutWalrus(code string) string {
// Converts: __out := ...
// To: __out = ...
re := regexp.MustCompile(`__out\s*:=`)
return re.ReplaceAllString(code, "__out = ")
}
The __out variable is pre-declared, so := would cause a redeclaration
error.
3. JSON to Go Literal Conversion
LLMs sometimes output JSON objects instead of Go map literals:
func toGoLiteral(v any) string {
switch val := v.(type) {
case map[string]any:
parts := make([]string, 0, len(val))
for k, v2 := range val {
parts = append(parts,
fmt.Sprintf("%q: %s", k, toGoLiteral(v2)))
}
return fmt.Sprintf("map[string]any{%s,}",
strings.Join(parts, ", "))
case []any:
items := make([]string, len(val))
for i := range val {
items[i] = toGoLiteral(val[i])
}
return fmt.Sprintf("[]any{%s}", strings.Join(items, ", "))
case string:
return fmt.Sprintf("%q", val)
// ... other cases
}
}
Injecting the Tool API
The sandboxed environment needs access to UTCP tools. We inject helper functions using Yaegi's reflection-based exports:
func injectHelpers(i *interp.Interpreter, client utcp.UtcpClientInterface) error {
i.Use(stdlib.Symbols) // Load Go standard library
exports := interp.Exports{
"codemode_helpers/codemode_helpers": map[string]reflect.Value{
"CallTool": reflect.ValueOf(func(name string, args map[string]any) (any, error) {
return client.CallTool(context.Background(), name, args)
}),
"CallToolStream": reflect.ValueOf(func(name string, args map[string]any) (*codeModeStream, error) {
stream, err := client.CallToolStream(context.Background(), name, args)
if err != nil {
return nil, err
}
return &codeModeStream{next: stream.Next}, nil
}),
"SearchTools": reflect.ValueOf(func(query string, limit int) ([]tools.Tool, error) {
return client.SearchTools(query, limit)
}),
"Sprintf": reflect.ValueOf(fmt.Sprintf),
"Errorf": reflect.ValueOf(fmt.Errorf),
},
}
return i.Use(exports)
}
These functions become available in generated code as codemode.CallTool(),
codemode.Sprintf(), etc.
Security: Sandboxing LLM-Generated Code
Running code written by an LLM requires careful isolation:
1. Yaegi Interpreter
- No filesystem access - Unless explicitly provided via helpers
- No network access - Isolated from the Internet
- No process spawning - Cannot execute external commands
- Isolated namespace - Runs in same process but separate context
2. Timeout Enforcement
Default 30-second timeout prevents infinite loops and runaway execution.
3. Panic Recovery
Interpreter panics are caught and returned as structured errors, preventing crashes.
4. Validation Rules
Basic syntax validation ensures code contains required elements before execution.
Real-World Example
Let's walk through a complete workflow:
User Query: "Search for Python tutorials and summarize the top 3 results"
Step 1: Decide Tools Needed
→ Result: { "needs": true }
Step 2: Select Tools
→ Result: { "tools": ["search.web", "text.summarize"] }
Step 3: Generate Code
// Search for tutorials
searchResult, err := codemode.CallTool("search.web", map[string]any{
"query": "Python tutorials",
"limit": 3,
})
if err != nil { return err }
var results []any
if m, ok := searchResult.(map[string]any); ok {
if r, ok := m["results"].([]any); ok {
results = r
}
}
// Extract top 3 and summarize
var summaries []any
for i := 0; i < 3 && i < len(results); i++ {
item := results[i]
var text string
if m, ok := item.(map[string]any); ok {
if t, ok := m["content"].(string); ok {
text = t
}
}
summary, err := codemode.CallTool("text.summarize", map[string]any{
"text": text,
"max_length": 100,
})
if err != nil { continue }
summaries = append(summaries, summary)
}
__out = map[string]any{
"query": "Python tutorials",
"summaries": summaries,
}
Step 4: Execute
The code runs in the Yaegi sandbox, calls the actual UTCP tools, and returns structured results—all in one execution.
Why This Works Better
CodeMode UTCP demonstrates that code is a better interface for tool orchestration than special tokens. By leveraging LLMs' natural code-writing abilities, we achieve:
- ✅ Lower latency - One execution vs. multiple round-trips
- ✅ Better composability - Express complex workflows naturally
- ✅ Robust error handling - Standard Go error patterns
- ✅ Full API access - No need to simplify tool interfaces
- ✅ Higher reliability - LLMs excel at writing code
- ✅ Debuggability - Inspect generated code and execution logs
The key insight: LLMs have seen millions of code examples but only synthetic tool-calling data. We should play to their strengths.
Performance Characteristics
Latency Breakdown
- LLM calls: 3 sequential calls (decide, select, generate) ≈ 2-5 seconds
- Code execution: Typically <100ms for simple workflows
- Tool calls: Depends on tool implementation
Optimization Opportunities
- Parallel LLM calls - Decide + Select could run concurrently
- Caching - Cache tool specs to reduce prompt size
- Streaming - Stream code generation for faster perceived latency
- Compiled mode - Pre-compile common patterns
Limitations and Trade-offs
Current Limitations
- Single-threaded execution - Yaegi runs code sequentially
- No filesystem access - Unless explicitly provided via helpers
- LLM quality dependency - Bad code generation = runtime errors
- Debugging complexity - Stack traces from interpreted code can be cryptic
Design Trade-offs
- Safety vs. Flexibility - Sandboxing limits what code can do
- Simplicity vs. Power - Go DSL is more constrained than Python
- Latency vs. Reliability - Multiple LLM calls increase latency but improve correctness
Try It Yourself
CodeMode UTCP is part of the go-utcp repository:
git clone https://github.com/universal-tool-calling-protocol/go-utcp
cd go-utcp/src/plugins/codemode
go test -v
Check out the README for more examples and API documentation.
Related Work
CodeMode UTCP builds on ideas from:
- Cloudflare's Code Mode - TypeScript-based approach for MCP
- Yaegi - Go interpreter
- UTCP - Universal Tool Calling Protocol
Special thanks to the UTCP community for feedback and contributions.