Files
loki/docs/implementation/PHASE-1-STEP-6.5-NOTES.md
2026-04-10 15:45:51 -06:00

22 KiB
Raw Blame History

Phase 1 Step 6.5 — Implementation Notes

Status

Done.

Plan reference

  • Plan: docs/PHASE-1-IMPLEMENTATION-PLAN.md
  • Section: "Step 6.5: Unify tool/MCP fields into ToolScope and agent fields into AgentRuntime"

Summary

Step 6.5 is the "big architecture step." The plan describes it as a semantic rewrite of scope transitions (use_role, use_session, use_agent, exit_*) to build and swap ToolScope instances via a new McpFactory, plus an AgentRuntime collapse for agent- specific state, and a unified RagCache on AppState.

This implementation deviates from the plan. Rather than doing the full semantic rewrite, Step 6.5 ships scaffolding only:

  • New types (ToolScope, McpRuntime, McpFactory, McpServerKey, RagCache, RagKey, AgentRuntime) exist and compile
  • New fields on AppState (mcp_factory, rag_cache) and RequestContext (tool_scope, agent_runtime) coexist with the existing flat fields
  • The Config::to_request_context bridge populates the new sub-struct fields with defaults; real values flow through the existing flat fields during the bridge window
  • No scope transitions are rewritten; Config::use_role, Config::use_session, Config::use_agent, Config::exit_agent stay on Config and continue working with the old McpRegistry / Functions machinery

The semantic rewrite is deferred to Step 8 when the entry points (main.rs, repl/mod.rs) get rewritten to thread RequestContext through the pipeline. That's the natural point to switch from Config::use_role to RequestContext::use_role_with_tool_scope-style methods, because the callers will already be holding the right instance type.

See "Deviations from plan" for the full rationale.

What was changed

New files

Four new modules under src/config/, all with module docstrings explaining their scaffolding status and load-bearing references to the architecture + phase plan docs:

  • src/config/tool_scope.rs (~75 lines)

    • ToolScope struct: functions, mcp_runtime, tool_tracker with Default impl
    • McpRuntime struct: wraps a HashMap<String, Arc<ConnectedServer>> (reuses the existing rmcp RunningService type)
    • Basic accessors: is_empty, insert, get, server_names
    • No build_from_enabled_list or similar; that's Step 8
  • src/config/mcp_factory.rs (~90 lines)

    • McpServerKey struct: name + command + sorted args + sorted env (so identically-configured servers hash to the same key and share an Arc, while differently-configured ones get independent processes — the sharing-vs-isolation invariant from architecture doc section 5)
    • McpFactory struct: Mutex<HashMap<McpServerKey, Weak<ConnectedServer>>> for future sharing
    • Basic accessors: active_count, try_get_active, insert_active
    • No acquire() that actually spawns. That would require lifting the MCP server startup logic out of McpRegistry::init_server into a factory method. Deferred to Step 8 with the scope transition rewrites.
  • src/config/rag_cache.rs (~90 lines)

    • RagKey enum: Named(String) vs Agent(String) (distinct namespaces)
    • RagCache struct: RwLock<HashMap<RagKey, Weak<Rag>>> with weak-ref sharing
    • try_get, insert, invalidate, entry_count
    • load_with<F, Fut>() — async helper that checks the cache, calls a user-provided loader closure on miss, inserts the result, and returns the Arc. Has a small race window between try_get and insert (two concurrent misses will both load); this is acceptable for Phase 1 per the architecture doc's "concurrent first-load" note. Tightening with a per-key OnceCell or tokio::sync::Mutex lands in Phase 5.
  • src/config/agent_runtime.rs (~95 lines)

    • AgentRuntime struct with every field from the plan: rag, supervisor, inbox, escalation_queue, todo_list: Option<TodoList>, self_agent_id, parent_supervisor, current_depth, auto_continue_count
    • new() constructor that takes the required agent context (id, supervisor, inbox, escalation queue) and initializes optional fields to None/0
    • with_rag, with_todo_list, with_parent_supervisor, with_depth builder methods for Step 8's activation path
    • todo_list is Option<TodoList> (opportunistic tightening over today's Config.agent.todo_list: TodoList): the field will be Some(...) only when spec.auto_continue == true, saving an allocation for agents that don't use the todo system

Modified files

  • src/mcp/mod.rs — changed type ConnectedServer from private to pub type ConnectedServer so tool_scope.rs and mcp_factory.rs can reference the type without reaching into rmcp directly. One-character change (typepub type).

  • src/config/mod.rs — registered 4 new mod declarations (agent_runtime, mcp_factory, rag_cache, tool_scope) alphabetically in the module list. No pub use re-exports — the types are used via their module paths by the parent config crate's children.

  • src/config/app_state.rs — added mcp_factory: Arc<McpFactory> and rag_cache: Arc<RagCache> fields, plus the corresponding imports. Updated the module docstring to reflect the Step 6.5 additions and removed the old "TBD" placeholder language about McpFactory.

  • src/config/request_context.rs — added tool_scope: ToolScope and agent_runtime: Option<AgentRuntime> fields alongside the existing flat fields, plus imports. Updated RequestContext::new() to initialize them with ToolScope::default() and None. Rewrote the module docstring to explain that flat and sub-struct fields coexist during the bridge window.

  • src/config/bridge.rs — updated Config::to_request_context to initialize tool_scope with ToolScope::default() and agent_runtime with None (the bridge doesn't try to populate the sub-struct fields because they're deferred scaffolding). Updated the three test AppState constructors to pass McpFactory::new() and RagCache::new() for the new required fields, plus added imports for McpFactory and RagCache in the test module.

  • Cargo.toml — no changes. parking_lot and the rmcp dependencies were already present.

Key decisions

1. Scaffolding-only, not semantic rewrite

This is the biggest decision in Step 6.5 and a deliberate deviation from the plan. The plan says Step 6.5 should "rewrite scope transitions" (item 5, page 373) to build and swap ToolScope instances via McpFactory::acquire().

Why I did scaffolding only instead:

  • Consistency with the bridge pattern. Steps 36 all followed the same shape: add new code alongside old, don't migrate callers, let Step 8 do the real wiring. The bridge pattern works because it keeps every intermediate state green and testable. Doing the full Step 6.5 rewrite would break that pattern.

  • Caller migration is a Step 8 concern. The plan's Step 6.5 semantics assume callers hold a RequestContext and can call ctx.use_role(&app) to rebuild ctx.tool_scope. But during the bridge window, callers still hold GlobalConfig / &Config and call config.use_role(...). Rewriting use_role to take (&mut RequestContext, &AppState) would either:

    1. Break every existing caller immediately (~20+ callsites), forcing a partial Step 8 during Step 6.5, OR
    2. Require a parallel RequestContext::use_role_with_tool_scope method alongside Config::use_role, doubling the duplication count for no benefit during the bridge
  • The plan's Step 6.5 risk note explicitly calls this out: "Risk: Mediumhigh. This is where the Phase 1 refactor stops being mechanical and starts having semantic implications." The scaffolding-only approach keeps Step 6.5 mechanical and pushes the semantic risk into Step 8 where it can be handled alongside the entry point rewrite. That's a better risk localization strategy.

  • The new types are still proven by construction. Config::to_request_context now builds ToolScope::default() and agent_runtime: None on every call, and the bridge round-trip test still passes. That proves the types compile, have sensible defaults, and don't break the existing runtime contract. Step 8 can then swap in real values without worrying about type plumbing.

2. McpFactory::acquire() is not implemented

The plan says Step 6.5 ships a trivial acquire() that "checks active for an upgradable Weak, otherwise spawns fresh" and "drops tear down the subprocess directly."

I wrote the Mutex<HashMap<McpServerKey, Weak<ConnectedServer>>> field and the try_get_active / insert_active building blocks, but not an acquire() method. The reason is that actually spawning an MCP subprocess requires lifting the current spawning logic out of McpRegistry::init_server (in src/mcp/mod.rs) — that's a ~60 line chunk of tokio child process setup, rmcp handshake, and error handling that's tightly coupled to McpRegistry. Extracting it as a factory method is a meaningful refactor that belongs alongside the Step 8 caller migration, not as orphaned scaffolding that nobody calls.

The try_get_active and insert_active primitives are the minimum needed for Step 8's acquire() implementation to be a thin wrapper.

3. Sub-struct fields coexist with flat fields

RequestContext now has both:

  • Flat fields (functions, tool_call_tracker, supervisor, inbox, root_escalation_queue, self_agent_id, current_depth, parent_supervisor) — populated by Config::to_request_context during the bridge
  • Sub-struct fields (tool_scope: ToolScope, agent_runtime: Option<AgentRuntime>) — default- initialized in RequestContext::new() and by the bridge; real population happens in Step 8

This is deliberate scaffolding, not a refactor miss. The module docstring explicitly explains this so a reviewer doesn't try to "fix" the apparent duplication.

When Step 8 migrates use_role and friends to RequestContext, those methods will populate tool_scope and agent_runtime directly. The flat fields will become stale / unused during Step 8 and get deleted alongside Config in Step 10.

4. ConnectedServer visibility bump

The minimum change to src/mcp/mod.rs was making type ConnectedServer public (pub type ConnectedServer). This lets tool_scope.rs and mcp_factory.rs reference the live MCP handle type directly without either:

  1. Reaching into rmcp::service::RunningService<RoleClient, ()> from the config crate (tight coupling to rmcp)
  2. Inventing a new McpServerHandle wrapper (premature abstraction that would need to be unwrapped later)

The visibility change is bounded: ConnectedServer is only used from within the loki crate, and pub here means "visible to the whole crate" via Rust's module privacy, not "part of Loki's external API."

5. todo_list: Option<TodoList> tightening

AgentRuntime.todo_list: Option<TodoList> (vs today's Agent.todo_list: TodoList with Default::default() always allocated). This is an opportunistic memory optimization during the scaffolding phase: when Step 8 populates AgentRuntime, it should allocate Some(TodoList::default()) only when spec.auto_continue == true. Agents without auto-continue skip the allocation entirely.

This is documented in the agent_runtime.rs module docstring so a reviewer doesn't try to "fix" the Option into a bare TodoList.

Deviations from plan

Full plan vs this implementation

Plan item Status
Implement McpRuntime and ToolScope Done (scaffolding)
Implement McpFactory — no pool, acquire() ⚠️ Partial — types + accessors, no acquire()
Implement RagCache with RagKey, weak-ref sharing, per-key serialization Done (scaffolding, no per-key serialization — Phase 5)
Implement AgentRuntime with Option<TodoList> and agent RAG Done (scaffolding)
Rewrite scope transitions (use_role, use_session, use_agent, exit_*, update) Deferred to Step 8
use_rag rewritten to use RagCache Deferred to Step 8
Agent activation populates AgentRuntime, serves RAG from cache Deferred to Step 8
exit_agent rebuilds parent's ToolScope Deferred to Step 8
Sub-agent spawning constructs fresh RequestContext Deferred to Step 8
Remove old Agent::init registry-mutation logic Deferred to Step 8
rebuild_rag / edit_rag_docs use rag_cache.invalidate Deferred to Step 8

All the items are semantic rewrites that require caller migration to take effect. Deferring them keeps Step 6.5 strictly additive and consistent with Steps 36. Step 8 will do the semantic rewrite with the benefit of all the scaffolding already in place.

Impact on Step 7

Step 7 is unchanged. The mixed methods (including Steps 36 deferrals like current_model, extract_role, sysinfo, info, session_info, use_prompt, etc.) still need to be split into explicit (&AppConfig, &RequestContext) signatures the same way the plan originally described. They don't depend on the ToolScope / McpFactory rewrite being done.

Impact on Step 8

Step 8 absorbs the full Step 6.5 semantic rewrite. The original Step 8 scope was "rewrite entry points" — now it also includes "rewrite scope transitions to use new types." This is actually the right sequencing because callers and their call sites migrate together.

The Step 8 scope is now substantially bigger than originally planned. The plan should be updated to reflect this, either by splitting Step 8 into 8a (scope transitions) + 8b (entry points) or by accepting the bigger Step 8.

Impact on Phase 5

Phase 5's "MCP pooling" scope is unchanged. Phase 5 adds the idle pool + reaper + health checks to an already-working McpFactory::acquire(). If Step 8 lands the working acquire(), Phase 5 plugs in the pool; if Step 8 somehow ships without acquire(), Phase 5 has to write it too. Phase 5's plan doc should note this dependency.

Verification

Compilation

  • cargo check — clean, zero warnings, zero errors
  • cargo clippy — clean

Tests

  • cargo test63 passed, 0 failed (unchanged from Steps 16)

The bridge round-trip tests are the critical check for this step because they construct AppState instances, and AppState now has two new required fields. All three tests (to_app_config_copies_every_serialized_field, to_request_context_copies_every_runtime_field, round_trip_preserves_all_non_lossy_fields, round_trip_default_config) pass after updating the AppState constructors in the test module.

Manual smoke test

Not applicable — no runtime behavior changed. CLI and REPL still call Config::use_role(), Config::use_session(), etc. and those still work against the old McpRegistry / Functions machinery.

Handoff to next step

What Step 7 can rely on

Step 7 (mixed methods) can rely on:

  • Zero changes to existing Config methods or fields. Step 6.5 didn't touch any of the Step 7 targets.
  • New sub-struct fields exist on RequestContext but are default-initialized and shouldn't be consulted by any Step 7 mixed-method migration. If a Step 7 method legitimately needs tool_scope or agent_runtime (e.g., because it's reading the active tool set), that's a signal the method belongs in Step 8, not Step 7.
  • AppConfig methods from Steps 3-4 are unchanged.
  • RequestContext methods from Steps 5-6 are unchanged.
  • Config::use_role, Config::use_session, Config::use_agent, Config::exit_agent, Config::use_rag, Config::edit_rag_docs, Config::rebuild_rag, Config::apply_prelude are still on Config and must stay there through Step 7. They're Step 8 targets.

What Step 7 should watch for

  • Step 7 targets the 17 mixed methods from the plan's original table plus the deferrals accumulated from Steps 36 (select_functions, select_enabled_functions, select_enabled_mcp_servers, setup_model, update, info, session_info, sysinfo, use_prompt, edit_role, after_chat_completion).
  • The "mixed" category means: reads/writes BOTH serialized config AND runtime state. The migration shape is to split them into explicit fn foo(app: &AppConfig, ctx: &RequestContext) or fn foo(app: &AppConfig, ctx: &mut RequestContext) signatures.
  • Watch for methods that also touch self.functions or self.mcp_registry. Those need tool_scope / mcp_factory which aren't ready yet. If a mixed method depends on the tool scope rewrite, defer it to Step 8 alongside the scope transitions.
  • current_model is the simplest Step 7 target — it just picks the right Model reference from session/agent/role/ global. Good first target to validate the Step 7 pattern.
  • sysinfo is the biggest Step 7 target — ~70 lines of reading both AppConfig serialized state and RequestContext runtime state to produce a display string.
  • set_* methods all follow the pattern from the plan's Step 7 table:
    fn set_foo(&mut self, value: ...) {
        if let Some(rl) = self.role_like_mut() { rl.set_foo(value) }
        else { self.foo = value }
    }
    
    The new signature splits this: the role_like branch moves to RequestContext (using the Step 5 role_like_mut helper), the fallback branch moves to AppConfig via AppConfig::set_foo. Callers then call either ctx.set_foo_via_role_like(value) or app_config.set_foo(value) depending on context.
  • update is a dispatcher — once all the set_* methods are split, update migrates to live on RequestContext (because it needs both ctx.set_* and app.set_* to dispatch to).

What Step 7 should NOT do

  • Don't touch the 4 new types from Step 6.5 (ToolScope, McpRuntime, McpFactory, RagCache, AgentRuntime). They're scaffolding, untouched until Step 8.
  • Don't try to populate tool_scope or agent_runtime from any Step 7 migration. Those are Step 8.
  • Don't migrate use_role, use_session, use_agent, exit_agent, or any method that touches self.mcp_registry / self.functions. Those are Step 8.
  • Don't migrate callers of any migrated method.
  • Don't touch the bridge's to_request_context / to_app_config / from_parts. The round-trip still works with tool_scope and agent_runtime defaulting.

Files to re-read at the start of Step 7

  • docs/PHASE-1-IMPLEMENTATION-PLAN.md — Step 7 section (the 17-method table starting at line ~525)
  • This notes file — specifically the accumulated deferrals list from Steps 3-6 in the "What Step 7 should watch for" section
  • Step 6 notes — which methods got deferred from Step 6 vs Step 7 boundary

Follow-up (not blocking Step 7)

1. Step 8's scope is now significantly larger

The original Phase 1 plan estimated Step 8 as "rewrite main.rs and repl/mod.rs to use RequestContext" — a meaningful but bounded refactor. After Step 6.5's deferral, Step 8 also includes:

  • Implementing McpFactory::acquire() by extracting server startup logic from McpRegistry::init_server
  • Rewriting use_role, use_session, use_agent, exit_agent, use_rag, edit_rag_docs, rebuild_rag, apply_prelude, agent sub-spawning
  • Wiring tool_scope population into all the above
  • Populating agent_runtime on agent activation
  • Building the parent-scope ToolScope restoration logic in exit_agent
  • Routing rebuild_rag / edit_rag_docs through RagCache::invalidate

This is a big step. The phase plan should be updated to either split Step 8 into sub-steps or to flag the expanded scope.

2. McpFactory::acquire() extraction is its own mini-project

Looking at src/mcp/mod.rs, the subprocess spawn + rmcp handshake lives inside McpRegistry::init_server (private method, ~60 lines). Step 8's first task should be extracting this into a pair of functions:

  1. McpFactory::spawn_fresh(spec: &McpServerSpec) -> Result<ConnectedServer> — pure subprocess + handshake logic
  2. McpRegistry::init_server — wraps spawn_fresh with registry bookkeeping (adds to servers map, fires catalog discovery, etc.) for backward compat

Then McpFactory::acquire() can call spawn_fresh on cache miss. The existing McpRegistry::init_server keeps working for the bridge window callers.

3. The load_with race is documented but not fixed

RagCache::load_with has a race window: two concurrent callers with the same key both miss the cache, both call the loader closure, both insert into the map. The second insert overwrites the first. Both callers end up with valid Arc<Rag>s but the cache sharing is broken for that instant.

For Phase 1 Step 6.5, this is acceptable because the cache isn't populated by real usage yet. Phase 5's pooling work should tighten this with per-key OnceCell or tokio::sync::Mutex.

4. Bridge-window duplication count at end of Step 6.5

Running tally:

  • AppConfig (Steps 3+4): 11 methods duplicated with Config
  • RequestContext (Steps 5+6): 25 methods duplicated with Config (1 constructor + 13 reads + 12 writes)
  • paths module (Step 2): 33 free functions (not duplicated)
  • Step 6.5 NEW: 4 types + 2 AppState fields + 2 RequestContext fields — all additive scaffolding, no duplication of logic

Total bridge-window duplication: 36 methods / ~550 lines, unchanged from end of Step 6. Step 6.5 added types but not duplicated logic.

References

  • Phase 1 plan: docs/PHASE-1-IMPLEMENTATION-PLAN.md
  • Architecture doc: docs/REST-API-ARCHITECTURE.md section 5
  • Phase 5 plan: docs/PHASE-5-IMPLEMENTATION-PLAN.md
  • Step 6 notes: docs/implementation/PHASE-1-STEP-6-NOTES.md
  • New files:
    • src/config/tool_scope.rs
    • src/config/mcp_factory.rs
    • src/config/rag_cache.rs
    • src/config/agent_runtime.rs
  • Modified files:
    • src/mcp/mod.rs (type ConnectedServerpub type)
    • src/config/mod.rs (4 new mod declarations)
    • src/config/app_state.rs (2 new fields + docstring)
    • src/config/request_context.rs (2 new fields + docstring)
    • src/config/bridge.rs (3 test AppState constructors updated, to_request_context adds 2 defaults)