test: implemented tests for tool call dispatch and tracking

2026-05-01 10:52:56 -06:00
parent 1df6114ff3
commit 53eff10d75
3 changed files with 487 additions and 31 deletions
@@ -0,0 +1,96 @@
+# Iteration 6 — Test Implementation Notes
+
+## Plan file addressed
+
+`docs/testing/plans/06-tool-evaluation.md`
+
+## Tests created
+
+### src/function/mod.rs (36 new tests)
+
+| Test name | What it verifies |
+|---|---|
+| `toolcall_new_sets_fields` | ToolCall::new sets name, arguments, id |
+| `toolcall_default_has_empty_fields` | Default ToolCall has empty/null fields |
+| `toolcall_with_thought_signature` | with_thought_signature sets value |
+| `toolcall_with_thought_signature_none` | with_thought_signature(None) clears |
+| `dedup_removes_duplicate_ids_keeps_last` | Duplicate ids → last occurrence kept |
+| `dedup_keeps_unique_ids` | Unique ids → all kept |
+| `dedup_keeps_calls_without_ids` | No-id calls always kept |
+| `dedup_preserves_last_occurrence_order` | Ordering based on last occurrence position |
+| `dedup_empty_input_returns_empty` | Empty vec → empty result |
+| `dedup_mixed_with_and_without_ids` | Mixed id/no-id dedup behavior |
+| `tracker_default_values` | Default max_repeats=2, chain_len=3 |
+| `tracker_no_loop_on_fresh_tracker` | Fresh tracker returns None |
+| `tracker_no_loop_below_threshold` | Below max_repeats → no loop |
+| `tracker_detects_loop_at_max_repeats` | At max_repeats → loop detected |
+| `tracker_different_args_no_loop` | Different args break loop detection |
+| `tracker_different_names_no_loop` | Different names break loop detection |
+| `tracker_chain_detection` | Chain of identical calls detected |
+| `tracker_record_call_respects_capacity` | Capacity bounded by chain_len * max_repeats |
+| `tracker_loop_message_contains_call_history` | Loop message includes call_history JSON |
+| `prefix_constants_are_correct` | All 6 prefixes: todo__, agent__, user__, mcp_invoke/search/describe |
+| `functions_default_is_empty` | Default Functions has no declarations |
+| `functions_append_todo_adds_declarations` | 5 todo tools: init, add, done, list, clear |
+| `functions_append_supervisor_adds_declarations` | Supervisor: spawn, check, collect, list, cancel, reply |
+| `functions_append_teammate_adds_declarations` | Teammate: send_message, check_inbox |
+| `functions_append_user_interaction_adds_declarations` | User: ask, confirm, input, checkbox |
+| `functions_append_mcp_meta_creates_three_per_server` | 3 MCP meta functions per server |
+| `functions_append_mcp_meta_multiple_servers` | Multiple servers → 3 each |
+| `functions_append_mcp_meta_empty_servers` | Empty servers → no declarations |
+| `functions_find_returns_declaration` | find() returns matching declaration |
+| `functions_find_returns_none_for_missing` | find() returns None for unknown |
+| `functions_contains_true_for_existing` | contains() true for known function |
+| `functions_contains_false_for_missing` | contains() false for unknown |
+| `functions_mcp_invoke_declaration_has_tool_and_arguments_params` | Invoke schema: tool + arguments params |
+| `functions_mcp_search_declaration_has_query_and_top_k_params` | Search schema: query + top_k params |
+| `functions_mcp_describe_declaration_has_tool_param` | Describe schema: tool param |
+| `functions_supervisor_includes_task_queue_tools` | Task queue: create, list, complete, fail |
+| `tool_result_stores_call_and_output` | ToolResult::new stores both fields |
+
+**Total: 36 new tests (212 total in suite)**
+
+## Bugs discovered
+
+None.
+
+## Observations for future iterations
+
+1. **ToolCall::dedup keeps the LAST occurrence**: The implementation
+   iterates in reverse and reverses again, so when duplicate ids
+   exist, the last occurrence wins. My initial tests assumed first-
+   wins behavior — caught and corrected during the iteration.
+
+2. **ToolCall::eval requires full RequestContext**: The dispatch
+   routing (`agent__*`, `todo__*`, `user__*`, `mcp_*`, shell
+   fallback) cannot be unit-tested because `eval()` takes
+   `&mut RequestContext` which requires an initialized AppState.
+   The prefix routing is verified indirectly through prefix
+   constant tests and function declaration tests.
+
+3. **Functions::init requires filesystem**: It calls
+   `build_global_tool_declarations` which reads tool files from
+   disk. Can't unit-test without a temp directory with actual
+   tool scripts. Function filtering by `enabled_tools` is thus
+   deferred.
+
+4. **All function declaration appenders are fully testable**: The
+   `append_*` methods on Functions work without I/O and produce
+   the exact function declarations the LLM sees. This is the most
+   important behavioral contract to test.
+
+5. **MCP meta function schemas are critical**: The invoke, search,
+   and describe meta functions each have specific parameter schemas
+   (tool+arguments, query+top_k, tool). Tests verify these schemas
+   exist with correct fields and required params.
+
+6. **ToolCallTracker loop detection has two mechanisms**:
+   - Consecutive repeat detection (same call N times in a row)
+   - Chain detection (same call repeated across the last chain_len
+     entries)
+   Both are tested independently.
+
+## Next iteration
+
+Plan file 07: Input Construction — Input::from_str, from_files,
+field capturing, function selection.
@@ -10,47 +10,73 @@ todo tools, and user interaction tools.
 ## Behaviors to test

 ### eval_tool_calls dispatch
- [ ] Calls dispatched to correct handler by function name prefix
- [ ] Tool results returned for each call
- [ ] Multiple concurrent tool calls processed
- [ ] Tool call tracker updated (chain length, repeats)
- [ ] Root agent (depth 0) checks escalation queue after eval
- [ ] Escalation notifications injected into results
+- [ ] Calls dispatched to correct handler by function name prefix (requires RequestContext)
+- [ ] Tool results returned for each call (requires RequestContext)
+- [ ] Multiple concurrent tool calls processed (requires RequestContext)
+- [x] Tool call tracker updated (chain length, repeats)
+- [ ] Root agent (depth 0) checks escalation queue after eval (requires RequestContext)
+- [ ] Escalation notifications injected into results (requires RequestContext)

 ### ToolCall::eval routing
- [ ] agent__* → handle_supervisor_tool
- [ ] todo__* → handle_todo_tool
- [ ] user__* → handle_user_tool (depth 0) or escalate (depth > 0)
- [ ] mcp_invoke_* → invoke_mcp_tool
- [ ] mcp_search_* → search_mcp_tools
- [ ] mcp_describe_* → describe_mcp_tool
- [ ] Other → shell tool execution
+- [ ] agent__* → handle_supervisor_tool (requires RequestContext)
+- [ ] todo__* → handle_todo_tool (requires RequestContext)
+- [ ] user__* → handle_user_tool (depth 0) or escalate (depth > 0) (requires RequestContext)
+- [ ] mcp_invoke_* → invoke_mcp_tool (requires RequestContext + live MCP)
+- [ ] mcp_search_* → search_mcp_tools (requires RequestContext + live MCP)
+- [ ] mcp_describe_* → describe_mcp_tool (requires RequestContext + live MCP)
+- [ ] Other → shell tool execution (requires RequestContext + binary)

 ### Shell tool execution
- [ ] Tool binary found and executed
- [ ] Arguments passed correctly
- [ ] Environment variables set (LLM_OUTPUT, etc.)
- [ ] Tool output returned as result
- [ ] Tool failure → error returned as tool result (not panic)
+- [ ] Tool binary found and executed (integration test)
+- [ ] Arguments passed correctly (integration test)
+- [ ] Environment variables set (LLM_OUTPUT, etc.) (integration test)
+- [ ] Tool output returned as result (integration test)
+- [ ] Tool failure → error returned as tool result (not panic) (integration test)

 ### Tool call tracking
- [ ] Tracker counts consecutive identical calls
- [ ] Max repeats triggers warning
- [ ] Chain length tracked across turns
- [ ] Tracker state preserved across tool-result loops
+- [x] Tracker counts consecutive identical calls
+- [x] Max repeats triggers warning
+- [x] Chain length tracked across turns
+- [x] Tracker state preserved across tool-result loops

 ### Function selection
- [ ] select_functions filters by role's enabled_tools
- [ ] select_functions includes MCP meta functions for enabled servers
- [ ] select_functions includes agent functions when agent active
- [ ] "all" enables all functions
- [ ] Comma-separated list enables specific functions
+- [ ] select_functions filters by role's enabled_tools (requires filesystem)
+- [x] select_functions includes MCP meta functions for enabled servers
+- [x] select_functions includes agent functions when agent active (via append tests)
+- [ ] "all" enables all functions (requires filesystem)
+- [ ] Comma-separated list enables specific functions (requires filesystem)

 ## Context switching scenarios
- [ ] Tool calls during agent → agent tools available
- [ ] Tool calls during role → role tools available
- [ ] Tool calls with MCP → MCP invoke/search/describe work
- [ ] No agent → no agent__/todo__ tools in declarations
+- [ ] Tool calls during agent → agent tools available (integration test)
+- [ ] Tool calls during role → role tools available (integration test)
+- [ ] Tool calls with MCP → MCP invoke/search/describe work (integration test)
+- [x] No agent → no agent__/todo__ tools in declarations (via Functions::default)
+
+## Additional behaviors tested (not in original plan)
+
+- [x] ToolCall::new sets name, arguments, id correctly
+- [x] ToolCall::default has empty/null fields
+- [x] ToolCall::with_thought_signature sets and clears
+- [x] ToolCall::dedup keeps last occurrence for duplicate ids
+- [x] ToolCall::dedup keeps all calls without ids
+- [x] ToolCall::dedup empty input returns empty
+- [x] ToolCall::dedup mixed with/without ids
+- [x] ToolCallTracker default values (max_repeats=2, chain_len=3)
+- [x] ToolCallTracker no loop on fresh tracker
+- [x] ToolCallTracker no loop below threshold
+- [x] ToolCallTracker different args breaks loop
+- [x] ToolCallTracker different names breaks loop
+- [x] ToolCallTracker record_call respects capacity
+- [x] ToolCallTracker loop message includes call_history
+- [x] All 6 prefix constants verified
+- [x] Functions::append_todo adds all 5 todo tools
+- [x] Functions::append_supervisor adds spawn/check/collect/list/cancel/reply + task queue
+- [x] Functions::append_teammate adds send_message/check_inbox
+- [x] Functions::append_user_interaction adds ask/confirm/input/checkbox
+- [x] Functions::append_mcp_meta creates 3 per server with correct schemas
+- [x] Functions::append_mcp_meta empty servers → no declarations
+- [x] Functions::find/contains work correctly
+- [x] ToolResult::new stores call and output

 ## Old code reference
 - `src/function/mod.rs` — eval_tool_calls, ToolCall::eval