Files
coyote/docs/testing/plans/06-tool-evaluation.md
T

4.1 KiB

Test Plan: Tool Evaluation

Feature description

When the LLM returns tool calls, eval_tool_calls dispatches each call to the appropriate handler. Handlers include: shell tools (bash/python/ts scripts), MCP tools, supervisor tools (agent spawn), todo tools, and user interaction tools.

Behaviors to test

eval_tool_calls dispatch

  • Calls dispatched to correct handler by function name prefix (requires RequestContext)
  • Tool results returned for each call (requires RequestContext)
  • Multiple concurrent tool calls processed (requires RequestContext)
  • Tool call tracker updated (chain length, repeats)
  • Root agent (depth 0) checks escalation queue after eval (requires RequestContext)
  • Escalation notifications injected into results (requires RequestContext)

ToolCall::eval routing

  • agent__* → handle_supervisor_tool (requires RequestContext)
  • todo__* → handle_todo_tool (requires RequestContext)
  • user__* → handle_user_tool (depth 0) or escalate (depth > 0) (requires RequestContext)
  • mcp_invoke_* → invoke_mcp_tool (requires RequestContext + live MCP)
  • mcp_search_* → search_mcp_tools (requires RequestContext + live MCP)
  • mcp_describe_* → describe_mcp_tool (requires RequestContext + live MCP)
  • Other → shell tool execution (requires RequestContext + binary)

Shell tool execution

  • Tool binary found and executed (integration test)
  • Arguments passed correctly (integration test)
  • Environment variables set (LLM_OUTPUT, etc.) (integration test)
  • Tool output returned as result (integration test)
  • Tool failure → error returned as tool result (not panic) (integration test)

Tool call tracking

  • Tracker counts consecutive identical calls
  • Max repeats triggers warning
  • Chain length tracked across turns
  • Tracker state preserved across tool-result loops

Function selection

  • select_functions filters by role's enabled_tools (requires filesystem)
  • select_functions includes MCP meta functions for enabled servers
  • select_functions includes agent functions when agent active (via append tests)
  • "all" enables all functions (requires filesystem)
  • Comma-separated list enables specific functions (requires filesystem)

Context switching scenarios

  • Tool calls during agent → agent tools available (integration test)
  • Tool calls during role → role tools available (integration test)
  • Tool calls with MCP → MCP invoke/search/describe work (integration test)
  • No agent → no agent__/todo__ tools in declarations (via Functions::default)

Additional behaviors tested (not in original plan)

  • ToolCall::new sets name, arguments, id correctly
  • ToolCall::default has empty/null fields
  • ToolCall::with_thought_signature sets and clears
  • ToolCall::dedup keeps last occurrence for duplicate ids
  • ToolCall::dedup keeps all calls without ids
  • ToolCall::dedup empty input returns empty
  • ToolCall::dedup mixed with/without ids
  • ToolCallTracker default values (max_repeats=2, chain_len=3)
  • ToolCallTracker no loop on fresh tracker
  • ToolCallTracker no loop below threshold
  • ToolCallTracker different args breaks loop
  • ToolCallTracker different names breaks loop
  • ToolCallTracker record_call respects capacity
  • ToolCallTracker loop message includes call_history
  • All 6 prefix constants verified
  • Functions::append_todo adds all 5 todo tools
  • Functions::append_supervisor adds spawn/check/collect/list/cancel/reply + task queue
  • Functions::append_teammate adds send_message/check_inbox
  • Functions::append_user_interaction adds ask/confirm/input/checkbox
  • Functions::append_mcp_meta creates 3 per server with correct schemas
  • Functions::append_mcp_meta empty servers → no declarations
  • Functions::find/contains work correctly
  • ToolResult::new stores call and output

Old code reference

  • src/function/mod.rs — eval_tool_calls, ToolCall::eval
  • src/function/supervisor.rs — handle_supervisor_tool
  • src/function/todo.rs — handle_todo_tool
  • src/function/user_interaction.rs — handle_user_tool