Files
loki/docs/REST-API-ARCHITECTURE.md
2026-04-10 15:45:51 -06:00

29 KiB

Architecture Plan: Loki REST API Service Mode

The Core Problem

Today, Loki's Config struct is a god object — it holds both server-wide configuration (LLM providers, vault, tool definitions) and per-interaction mutable state (current role, session, agent, supervisor, inbox, tool tracker) in one Arc<RwLock<Config>>. CLI and REPL both mutate this singleton directly. Adding a third interface (REST API) that handles concurrent users makes this untenable.

Design Pattern: Engine + Context + Emitter

The refactor splits Loki into three layers:

┌─────────┐  ┌─────────┐  ┌─────────┐
│   CLI   │  │  REPL   │  │   API   │   ← Thin adapters (frontends)
└────┬────┘  └────┬────┘  └────┬────┘
     │            │            │
     ▼            ▼            ▼
   ┌──────────────────────────────┐
   │     RunRequest + Emitter     │   ← Uniform request shape
   └──────────────┬───────────────┘
                  ▼
   ┌──────────────────────────────┐
   │          Engine::run()       │   ← Single core entrypoint
   │  (input → messages → LLM    │
   │   → tool loop → events)     │
   └──────────────┬───────────────┘
                  │
     ┌────────────┼────────────┐
     ▼            ▼            ▼
  AppState   RequestContext  SessionStore
  (global,   (per-request,  (file-backed,
   immutable) mutable)       per-session lock)

1. Split Config → AppState (global) + RequestContext (per-request)

AppState — created once at startup, wrapped in Arc, never mutated during requests:

#[derive(Clone)]
pub struct AppState {
    pub config: Arc<AppConfig>,           // deserialized config.yaml (frozen)
    pub providers: ProviderRegistry,      // LLM client configs + OAuth tokens
    pub vault: Arc<VaultService>,         // encrypted credential storage (internal locking)
    pub tools: Arc<ToolRegistry>,         // tool definitions, function dirs, visible_tools
    pub mcp_global: Arc<McpGlobalConfig>, // global MCP settings (not live instances)
    pub sessions: Arc<dyn SessionStore>,  // file-backed session persistence
    pub rag_defaults: RagDefaults,        // embedding model, chunk size, etc.
}

RequestContext — created per CLI invocation, per REPL turn, or per API request:

pub struct RequestContext {
    pub app: Arc<AppState>,               // borrows global state
    pub request_id: Uuid,
    pub mode: FrontendMode,               // Cli | Repl | Api
    pub cancel: CancellationToken,        // unified cancellation

    // per-request mutable state (was on Config)
    pub session: SessionHandle,
    pub convo: ConversationState,         // messages, last_message, tool_call_tracker
    pub agent: Option<AgentRuntime>,      // supervisor, MCP instances, inbox, escalation
    pub overrides: Overrides,             // model, role, rag, dry_run, etc.
    pub auth: Option<AuthContext>,        // API-only; None for CLI/REPL
}

pub struct Overrides {
    pub role: Option<String>,
    pub model: Option<String>,
    pub rag: Option<RagConfig>,
    pub agent: Option<AgentSpec>,
    pub dry_run: bool,
    pub macro_mode: bool,
}

What changes for existing code

Every function that currently takes &GlobalConfig (i.e., Arc<RwLock<Config>>) and calls .read() / .write() gets refactored to take &AppState for reads and &mut RequestContext for mutations. The config.write().set_model(...) pattern becomes ctx.overrides.model = Some(...).

REPL special case

The REPL keeps a long-lived RequestContext that persists across turns (just like today's Config singleton does). State-changing dot-commands (.model, .role, .session) mutate the REPL's own context. This preserves current behavior exactly.


2. Unified Dispatch: The Engine

Instead of start_directive() in main.rs and ask() in repl/mod.rs being separate code paths, both call one core function:

pub struct Engine {
    pub app: Arc<AppState>,
    pub agent_factory: Arc<dyn AgentFactory>,
}

impl Engine {
    pub async fn run(
        &self,
        ctx: &mut RequestContext,
        req: RunRequest,
        emitter: &dyn Emitter,
    ) -> Result<RunOutcome, CoreError> {
        // 1. Apply any CoreCommand (set role, model, session, etc.)
        // 2. Build Input from req.input + ctx (role messages, session history, RAG)
        // 3. Create LLM client from provider registry
        // 4. call_chat_completions[_streaming](), emitting events via emitter
        // 5. Tool result loop (recursive)
        // 6. Persist session updates
        // 7. Return outcome (session_id, message_id)
    }
}

pub struct RunRequest {
    pub input: UserInput,                  // text, files, media
    pub command: Option<CoreCommand>,      // normalized dot-command
    pub stream: bool,
}

pub enum CoreCommand {
    SetRole(String),
    SetModel(String),
    StartSession { name: Option<String> },
    StartAgent { name: String, variables: HashMap<String, String> },
    Continue,
    Regenerate,
    CompressSession,
    Info,
    // ... one variant per REPL dot-command
}

How frontends use it

Frontend Context lifetime How it calls Engine
CLI Single invocation, then exit Creates RequestContext, calls engine.run() once, exits
REPL Long-lived across turns Keeps RequestContext, calls engine.run() per line, dot-commands become CoreCommand variants
API Per HTTP request, but session persists Loads RequestContext from SessionStore per request, calls engine.run(), persists back

3. Output Abstraction: The Emitter Trait

The core never writes to stdout or formats JSON. It emits structured semantic events:

pub enum Event<'a> {
    Started { request_id: Uuid, session_id: Uuid },
    AssistantDelta(&'a str),              // streaming token
    AssistantMessageEnd { full_text: &'a str },
    ToolCall { name: &'a str, args: &'a str },
    ToolResult { name: &'a str, result: &'a str },
    Info(&'a str),
    Error(CoreError),
}

#[async_trait]
pub trait Emitter: Send + Sync {
    async fn emit(&self, event: Event<'_>) -> Result<(), EmitError>;
}

Three implementations

  • TerminalEmitter — wraps the existing SseHandlermarkdown_stream / raw_stream logic. Renders to terminal with crossterm. Used by both CLI and REPL.
  • JsonEmitter — collects all events, returns a JSON response body at the end. Used by non-streaming API requests.
  • SseEmitter — converts each Event to an SSE frame, pushes into a tokio::sync::mpsc channel that axum streams to the client. Used by streaming API requests.

4. Session Isolation for API

Session IDs

UUID-based for API consumers. CLI/REPL keep human-readable names as aliases.

#[async_trait]
pub trait SessionStore: Send + Sync {
    async fn create(&self, alias: Option<&str>) -> Result<SessionHandle>;
    async fn open(&self, id: SessionId) -> Result<SessionHandle>;
    async fn open_by_name(&self, name: &str) -> Result<SessionHandle>;  // CLI/REPL compat
}

File layout

~/.config/loki/sessions/
  by-id/<uuid>/state.yaml       # canonical storage
  by-name/<name> -> <uuid>      # symlink or mapping file for CLI/REPL

Concurrency

Each SessionHandle holds a tokio::sync::Mutex so two concurrent API requests to the same session serialize properly. For v1 this is sufficient — no need for a database.


5. Tool Scope Isolation (formerly "Agent Isolation")

Correction: An earlier version of this document singled out agents as the owner of "live tool and MCP runtime." That was wrong. Loki allows MCP servers and tools to be configured at every RoleLike level — global, role, session, and agent — with resolution priority Agent > Session > Role > Global. Agents aren't uniquely coupled to MCP lifecycle; they're just the most visibly coupled scope in today's code.

The correct abstraction is ToolScope: every active RoleLike owns one. A ToolScope is a self-contained unit holding the resolved function declarations, live MCP runtime handles, and the tool-call tracker for whichever scope is currently on top of the stack.

Today's behavior (to match in v1)

McpRegistry::reinit() is already diff-based: given a new enabled-server list, it stops only the servers that are no longer needed, leaves still-needed ones alive, and starts only the missing ones. This is correct single-tenant behavior but the registry is a process-wide singleton, so two concurrent consumers with different MCP sets trample each other.

Target design

pub struct ToolScope {
    pub functions: Functions,              // resolved declarations for this scope
    pub mcp_runtime: McpRuntime,           // live handles to MCP processes
    pub tool_tracker: ToolCallTracker,     // per-scope call tracking
}

pub struct McpRuntime {
    servers: HashMap<String, Arc<McpServerHandle>>,  // live, ref-counted
}

pub struct McpFactory {
    shared_servers: Mutex<HashMap<McpServerKey, Weak<McpServerHandle>>>,
}

impl McpFactory {
    /// Produce a runtime with handles for the requested enabled servers.
    /// Shared across ToolScopes via Arc when configs match; isolated when they differ.
    pub async fn build_runtime(&self, enabled: &[String]) -> Result<McpRuntime>;
}

McpFactory lives on AppState. It does NOT hold any live servers itself — it holds weak refs so that when the last ToolScope using a given server drops its Arc, the process is torn down.

ToolScope lives on RequestContext. It replaces the current functions, tool_call_tracker, and (implicit) global mcp_registry fields. Every active scope — whether that's "just the REPL with its global MCP set" or "an agent with its own MCP set" — owns exactly one ToolScope.

Scope transitions

When a RoleLike activates or exits:

  1. Resolve the effective enabled-tool and enabled-MCP-server lists using priority Agent > Session > Role > Global.
  2. Ask McpFactory::build_runtime(enabled) for an McpRuntime. The factory reuses existing Arc<McpServerHandle>s where keys match; spawns new processes where they don't.
  3. Construct a new ToolScope with the runtime + resolved Functions.
  4. Assign it to ctx.tool_scope. The old ToolScope drops; any Arc<McpServerHandle>s with no other references shut down their processes.

This preserves today's diff-based behavior for single-tenant (REPL) and makes it correct for multi-tenant (API).

Sharing vs isolation (the key property)

McpServerKey encodes server name + command + args + env vars. Two ToolScopes requesting the same key share the same Arc<McpServerHandle>. Two requesting different keys (e.g., different per-user API keys baked into the env) get separate processes. This gives us:

  • Isolation by default — different configs = different processes, no cross-tenant leakage
  • Sharing by coincidence — identical configs = one process, ref-counted
  • Clean cleanup — processes die automatically when the last scope releases them

Agent-specific state

Agents still own some state that's genuinely agent-only (not in ToolScope): the supervisor, inbox, escalation queue, optional todo list, sub-agent handles, and the parent/child tree. That state lives in an AgentRuntime:

pub struct AgentRuntime {
    pub spec: AgentSpec,
    pub rag: Option<Arc<Rag>>,                   // shared across sibling sub-agents
    pub supervisor: Supervisor,
    pub inbox: Arc<Inbox>,
    pub escalation_queue: Arc<EscalationQueue>,  // root-shared for user interaction
    pub todo_list: Option<TodoList>,             // present only when auto_continue: true
    pub self_agent_id: String,
    pub parent_supervisor: Option<Arc<Supervisor>>,
    pub current_depth: usize,
    pub auto_continue_count: usize,
}

Three things to notice in this shape:

  1. todo_list: Option<TodoList> — today's code eagerly allocates a TodoList::default() for every agent, but the todo tools and auto-continuation prompts are only exposed when auto_continue: true. Switching to Option lets us skip the allocation entirely for agents that don't opt in, and makes the "is this agent using todos?" question a type-level check rather than a config lookup. The semantics users see are unchanged.

  2. rag: Option<Arc<Rag>> — agent RAG is an Arc, not an owned Rag. Today, every sub-agent of the same type independently calls Rag::load() and deserializes its own copy of the embeddings from disk. That means a parent spawning 4 parallel siblings of the same agent type pays the deserialize cost 5 times and holds 5 copies of identical vectors in memory. Sharing via Arc fixes both.

  3. No mcp_runtime — MCP lives on ToolScope, not here. Agents get their tools through ctx.tool_scope like everyone else.

An AgentRuntime goes into ctx.agent_runtime in addition to the ToolScope — they're orthogonal concerns. An agent has both a ToolScope (its resolved tools + MCP) and an AgentRuntime (its supervision/messaging/RAG/todo state).

RAG Cache (unified for standalone + agent RAG)

RAG in Loki comes from exactly two places today:

  1. Standalone RAG, attached via the .rag <name> REPL command or the equivalent API call. Persists across role/session switches. Lives in ctx.rag: Option<Arc<Rag>>.
  2. Agent RAG, loaded from the documents: field of an agent's config.yaml when the agent is activated. Lives in ctx.agent_runtime.rag: Option<Arc<Rag>> for the agent's lifetime.

Roles and Sessions do not own RAG — the Role and Session structs have no RAG fields. This is true today and the refactor preserves it.

Since both standalone and agent RAGs are ultimately Arc<Rag> instances loaded from disk YAML files, a single cache can serve both. AppState holds one:

pub struct AppState {
    pub config: Arc<AppConfig>,
    pub vault: GlobalVault,
    pub mcp_factory: Arc<McpFactory>,
    pub rag_cache: Arc<RagCache>,
}

pub struct RagCache {
    entries: RwLock<HashMap<RagKey, Weak<Rag>>>,
}

#[derive(Hash, Eq, PartialEq, Clone, Debug)]
pub enum RagKey {
    Named(String),   // standalone RAG: rags/<name>.yaml
    Agent(String),   // agent-owned RAG: agents/<name>/rag.yaml
}

impl RagCache {
    /// Returns a shared Arc<Rag> for the given key. If another scope
    /// holds a live reference, returns that exact Arc. Otherwise loads
    /// from disk, stores a Weak for future sharing, returns a fresh Arc.
    /// Concurrent first-load is serialized via per-key locks.
    pub async fn load(&self, key: &RagKey) -> Result<Option<Arc<Rag>>>;

    /// Invalidates the cache entry. Called by rebuild_rag / edit_rag_docs
    /// so the next load reads from disk. Does NOT affect existing Arc
    /// holders — they keep their old Rag until they drop it.
    pub fn invalidate(&self, key: &RagKey);
}

Why the enum: agent RAGs and standalone RAGs live at different paths on disk and could theoretically have overlapping names (an agent called "docs" and a standalone rag called "docs"). Keeping them in distinct namespaces avoids collisions and keeps the cache lookups unambiguous.

Why Weak: we don't want the cache to pin RAGs in memory forever. If no scope holds an Arc<Rag> for key X, the Weak becomes dangling, and the next load() reads fresh. "Share while in use, drop when nobody needs it" without a manual reaper.

Concurrency wrinkle: if two consumers request the same key at exactly the same time and neither finds a live entry, both will race to load from disk. Fix with per-key tokio::sync::Mutex or once_cell::sync::OnceCell<Arc<Rag>> — the second caller blocks briefly and receives the shared Arc.

Invalidation: both rebuild_rag and edit_rag_docs call invalidate() with the key corresponding to whichever RAG was being operated on (standalone or agent-owned). Existing Arc<Rag> holders keep their old reference until they drop it — which is the correct behavior, since you don't want a running request to suddenly see a partially-rebuilt index mid-execution.

Where RAG attaches in RequestContext

Two distinct slots, two distinct purposes, one shared cache:

pub struct RequestContext {
    // ... other fields ...
    pub rag: Option<Arc<Rag>>,            // standalone RAG from `.rag <name>` or API equivalent
    pub agent_runtime: Option<AgentRuntime>,  // contains its own `rag: Option<Arc<Rag>>` when agent owns one
}

When resolving "what RAG should this request use", the engine checks ctx.agent_runtime.rag first (agent-owned takes precedence during an agent turn), then falls back to ctx.rag (the user's standalone selection). If neither is set, no RAG context is injected into the prompt.

Behavior preservation: today's code uses a single Config.rag slot that's overwritten by whichever action touched it most recently — use_rag and use_agent both clobber it. Exiting an agent leaves the overwrite in place; the user has to re-run .rag <name> to restore their standalone RAG. The new two-slot design gives us the opportunity to fix that (save ctx.rag into the AgentRuntime on activation, restore on exit) but Phase 1 preserves today's clobber-and-forget behavior to keep the refactor mechanical. The improvement is flagged as a Phase 2+ enhancement.

Sub-agent spawning

Each child agent gets its own RequestContext forked from the parent's Arc<AppState>. That means each child gets:

  • Its own ToolScope built from its agent.yaml's mcp_servers + global_tools, produced by McpFactory
  • Its own AgentRuntime with a fresh supervisor, a fresh inbox, depth = parent.depth + 1
  • A parent_supervisor reference pointing back at the parent's supervisor for escalation/messaging
  • A shared root_escalation_queue cloned by Arc from the parent's runtime (one queue, one human at the root)
  • A shared rag: Option<Arc<Rag>> via AppState.rag_cache.load(RagKey::Agent(child_agent_name)) — if the parent already holds a strong ref, the cache returns the same Arc and no disk I/O happens

Because each child has its own ToolScope, concurrent sub-agents can run with different MCP server sets simultaneously — something today's singleton registry cannot do. The McpFactory pool handles overlap: if child A and child B both need github with matching keys, they share one github process via Arc.

Because sibling sub-agents of the same type share one Arc<Rag> through the unified cache, RAG embeddings are loaded at most once per (standalone or agent) name per process, regardless of how many siblings or concurrent API sessions reference the same name. The first holder keeps the embeddings warm for everyone else's lifetime, and they drop together once nobody holds a reference.

MCP Lifecycle Policy (pooling and idle timeout)

McpFactory needs an eviction policy so long-running server processes don't accumulate idle MCP subprocesses indefinitely. The design is a two-layer scheme:

pub struct McpFactory {
    active: Mutex<HashMap<McpServerKey, Weak<McpServerHandle>>>,
    idle: Mutex<HashMap<McpServerKey, IdleEntry>>,
    config: McpFactoryConfig,
}

struct IdleEntry {
    handle: Arc<McpServerHandle>,
    idle_since: Instant,
}

pub struct McpFactoryConfig {
    pub idle_timeout: Duration,              // how long idle servers stay warm
    pub cleanup_interval: Duration,          // how often the reaper runs
    pub max_idle_servers: Option<usize>,     // LRU cap (None = unbounded)
}

Layer 1 — active references via Arc. Scopes currently using a server hold Arc<McpServerHandle>. Standard Rust refcounting. Any live reference keeps the process running, regardless of timers.

Layer 2 — idle grace period via LRU eviction. When the last active scope drops its Arc, a custom Drop impl on the handle moves it into the idle pool with a timestamp instead of tearing it down immediately. A background reaper task wakes on cleanup_interval and evicts entries whose idle time exceeds idle_timeout, calling cancel().await on the actual MCP subprocess.

Acquisition order on every scope transition:

impl McpFactory {
    pub async fn acquire(&self, key: &McpServerKey) -> Result<Arc<McpServerHandle>> {
        // 1. Someone else is actively using it — share.
        if let Some(arc) = self.try_reuse_active(key) { return Ok(arc); }
        // 2. Sitting in the idle pool — revive it, zero startup cost.
        if let Some(arc) = self.revive_from_idle(key) { return Ok(arc); }
        // 3. Neither — spawn fresh.
        self.spawn_new(key).await
    }
}

Sensible defaults by deployment mode:

Mode idle_timeout default Rationale
CLI one-shot N/A (process exits, everything dies) No pooling needed
REPL 0 (immediate drop) Matches today's reactive reinit behavior
API server 5 minutes Absorbs burst traffic, caps stale resources

These are defaults, not mandates. Users should be able to override globally and per-server:

# config.yaml
mcp_pool:
  idle_timeout_seconds: 300
  cleanup_interval_seconds: 30
  max_idle_servers: 50
// functions/mcp.json
{
  "github":     { "command": "...", "idle_timeout_seconds": 900 },
  "filesystem": { "command": "...", "idle_timeout_seconds": 60 }
}

Optional health checks. While a handle sits in the idle pool, the reaper can optionally ping it via tools/list. If a server has crashed or become unresponsive, it's evicted immediately. Without this, a stale idle entry would make the first real request after revival fail. Worth implementing, but not strictly required for v1.

Graceful shutdown. On server shutdown, drain active scopes (let in-flight LLM calls complete or cancel via token), then tear down the idle pool. Give it a bounded drain timeout before force-killing. Especially important for MCP servers holding external transactions or locks.

Per-tenant isolation. McpServerKey includes env vars in its hash, so two tenants with different GITHUB_TOKENs get distinct keys and therefore distinct processes. Zero cross-tenant leakage by construction.

Phasing

Phase 1 ships McpFactory without the pool — just acquire() that always spawns fresh, Drop that always tears down. This is correct but inefficient. Phase 5 adds the idle pool, reaper task, health checks, and configuration knobs. Splitting it this way keeps Phase 1 focused on the state split (its actual goal) and Phase 5 focused on the pooling optimization (where it has a clear performance target: warm-path MCP tool calls should have near-zero overhead).

Lifecycle summary

Frontend ToolScope lifetime AgentRuntime lifetime RAG lifetime
CLI one-shot One invocation One invocation (if --agent) One invocation
REPL Long-lived, rebuilt on .role / .session / .agent / .set enabled_mcp_servers Lives from .agent X until .exit agent Standalone RAG set via .rag <name> persists across role/session switches; agent RAG lives as long as the AgentRuntime; both come from the shared RagCache
API session Lives while session is "warm"; rebuilt when client changes role/session/agent Lives while session is "warm" Same as REPL; RagCache shares Arc<Rag>s across concurrent sessions using the same RAG name
Sub-agent (any frontend) Lives for the sub-agent task Lives for the sub-agent task Shared via Arc with parent and siblings through RagCache

6. Cross-Cutting Concerns

Concern Pattern CLI REPL API
Errors Core returns CoreError enum; frontends map render_error() to stderr render_error() to terminal { "error": { "code": "...", "message": "..." } } JSON
Cancellation CancellationToken in RequestContext Ctrl-C handler triggers token Ctrl-C triggers token Client disconnect / request timeout triggers token
Auth Middleware sets AuthContext on RequestContext None (local user) None (local user) Bearer token / API key validated by axum middleware
Tracing tracing::Span per request with request_id, session_id, mode Log to file Log to file Log to file + structured JSON logs

Error type

pub enum CoreError {
    InvalidRequest { msg: String },
    NotFound { msg: String },
    Unauthorized { msg: String },
    Forbidden { msg: String },
    Timeout { msg: String },
    Cancelled,
    Provider { msg: String },
    Tool { msg: String },
    Io { msg: String },
}

Cancellation

Use a CancellationToken in RequestContext. The core checks it via tokio::select! around long awaits (LLM stream, tool execution, MCP IO).

  • CLI/REPL: Ctrl-C handler triggers token.
  • API: axum provides disconnect detection for SSE/streaming; when the client drops, cancel the token.
  • Timeouts: set deadline and translate to token cancellation.

Auth (API-only initially)

axum middleware authenticates (API key / bearer token), builds AuthContext, stores in request extensions, then the handler copies it into RequestContext. Core enforces policy only when executing sensitive operations (tools, filesystem, vault).

pub struct AuthContext {
    pub subject: String,
    pub scopes: Vec<String>,
}

7. API Endpoint Design

POST   /v1/completions                    # one-shot prompt (no session)
POST   /v1/sessions                       # create session
POST   /v1/sessions/:id/completions       # prompt within session
DELETE /v1/sessions/:id                   # close session
POST   /v1/sessions/:id/agent             # activate agent on session
DELETE /v1/sessions/:id/agent             # deactivate agent
POST   /v1/sessions/:id/role              # set role on session
POST   /v1/sessions/:id/rag              # attach RAG to session
GET    /v1/models                         # list available models
GET    /v1/agents                         # list available agents
GET    /v1/roles                          # list available roles

Request body for completions

{
  "prompt": "Explain TCP handshake",
  "model": "openai:gpt-4o",
  "stream": true,
  "files": ["path/to/doc.pdf"],
  "role": "explain"
}

8. Implementation Phases

Phase Scope Effort Risk
Phase 1: Extract AppState Split Config into AppState (global) + per-request state. Keep CLI/REPL working exactly as before. No API yet. ~1-2 weeks Medium — touching every file that uses GlobalConfig
Phase 2: Introduce Engine + Emitter Unify start_directive() and ask() behind Engine::run(). Create TerminalEmitter. CLI/REPL now call Engine. ~1 week Low — refactoring existing paths
Phase 3: SessionStore abstraction Extract session persistence behind trait. Add UUID-based sessions. CLI/REPL still use name-based aliases. ~3-5 days Low
Phase 4: REST API server Add --serve flag. axum handlers that create RequestContext, call Engine::run(), return JSON/SSE. Basic auth middleware. ~1-2 weeks Low — clean layer on top of Engine
Phase 5: Agent isolation Move agent runtime into RequestContext. AgentFactory creates isolated runtimes per session. ~1 week Medium — MCP server lifecycle mgmt
Phase 6: Production hardening Rate limiting, proper auth, request validation, health checks, graceful shutdown, deployment configs. ~1 week Low

Total estimate: ~5-7 weeks for a production-ready v1.

Key Risk: Phase 1

Phase 1 is the hardest and riskiest — it touches nearly every module. The mitigation is to do it incrementally: first add AppState alongside existing Config, then migrate callers module by module, then remove the old GlobalConfig type alias. Tests should pass at every intermediate step.


Key Design Decisions & Trade-offs

  1. Eliminates the singleton mutation bottleneck: concurrency becomes "multiple RequestContexts" rather than fighting over RwLock<Config>.
  2. Preserves current behavior: REPL can keep "state-changing commands" by mutating its own long-lived RequestContext + persisted SessionState.
  3. Streaming becomes portable: terminal rendering, JSON, and SSE are just different Emitters over the same event stream.
  4. Agent/MCP isolation is explicit: prevents cross-session conflicts by construction.

Watch Out For

  1. Persisted vs in-memory drift: decide which fields live in SessionState vs ConversationState; persist only what must survive process restarts.
  2. Per-session concurrency semantics: either serialize requests per session (simplest) or carefully merge message histories; v1 should serialize.
  3. MCP process lifecycle: if you keep MCP servers alive across requests, tie them to a session runtime and clean them up on session close/TTL.

Future Considerations

  1. Swap file store behind SessionStore with sqlite without changing core.
  2. Add a stable public API schema for events so clients can render rich tool-call UIs.
  3. Actor model (one tokio task per session receiving commands via mpsc) for simplified session+agent lifetime management.