docs: created docs for the new graph agent system

2026-05-18 15:37:53 -06:00
parent f40cc8af27
commit 6e08cfd984
3 changed files with 966 additions and 0 deletions
@@ -4,6 +4,12 @@ Agents in Loki follow the same style as OpenAI's GPTs. They consist of 3 parts:
 * [RAG](RAG) - Pre-built knowledge bases specifically for the agent
 * [Function Calling](Tools#tools) ([#2](MCP-Servers)) - Extends the functionality of the LLM through custom functions it can call

+> **Looking for declarative, multi-step workflows?** See
+> [Graph Agents](Graph-Agents): a YAML-driven workflow engine where each step
+> (LLM call, script, user prompt, child-agent spawn) is its own typed node.
+> Useful when an agent's behavior follows a fixed shape rather than a single
+> open-ended LLM loop.
+
 ![Agent example](./images/agents/sql.gif)

 Agent configuration files are stored in the `agents` subdirectory of your Loki configuration directory. The location of
@@ -738,3 +744,8 @@ Loki comes packaged with some useful built-in agents:
 * `oracle`: An agent for high-level architecture, design decisions, and complex debugging
 * `sisyphus`: A powerhouse orchestrator agent for writing complex code and acting as a natural language interface for your codebase (similar to ClaudeCode, Gemini CLI, Codex, or OpenCode). Uses sub-agent spawning to delegate to `explore`, `coder`, and `oracle`.
 * `sql`: A universal SQL agent that enables you to talk to any relational database in natural language
+
+Loki writes these built-in agents to your agents directory on first run and never overwrites them afterward, so any
+edits you make to them are preserved across Loki updates. To discard your local changes and reinstall the built-in
+agents from the current Loki build, run `loki --install agents` (or `.install agents` in the REPL). Agents you created
+yourself are not affected.
@@ -0,0 +1,950 @@
+Graph-based agents are a declarative, YAML-driven workflow engine layered on
+top of Loki's existing agent system. Where a normal [agent](Agents) runs as a
+single LLM loop driven by tool calls, a **graph agent** is a directed graph of
+typed nodes. Each node performs one well-defined step (call an LLM, run a
+script, ask the user a question, spawn a child agent, etc.) and routes to the
+next node based on its result.
+
+Graph agents are best for workflows that:
+
+- Have a fixed shape (e.g. parse -> query -> grade -> synthesize -> verify)
+- Mix LLM calls with deterministic steps (scripts, user prompts)
+- Need explicit human-in-the-loop checkpoints
+- Benefit from per-step model / tool / temperature overrides
+
+If you just want an agent that takes a goal and figures out the steps on its
+own, stick with a regular [agent](Agents).
+
+---
+
+# Directory Structure
+
+A graph agent is defined by a single `graph.yaml`. It holds *both* the
+agent-level config (model, tools, MCP servers) *and* the workflow:
+
+```
+<loki-config-dir>/agents
+    └── my-graph-agent
+        ├── graph.yaml           # agent config + workflow definition
+        ├── tools.sh             # optional custom tools
+        ├── <rag-node-id>.yaml   # auto-built knowledge base for a rag node
+        └── scripts/             # optional script-node implementations
+            ├── decide.py
+            └── verify.py
+```
+
+`<rag-node-id>.yaml` files are generated by Loki at agent load time - one
+per `rag` node - and should not be hand-edited.
+
+An agent directory must contain **either** a `config.yaml` (a normal,
+LLM-loop agent (see [Agents](Agents))) **or** a `graph.yaml` (a graph
+agent). Never both. The presence of `graph.yaml` is what marks an agent
+as a graph agent; when Loki runs it, execution is driven entirely by the
+graph.
+
+**Both files present is an error.** If an agent directory contains both
+`config.yaml` and `graph.yaml`, Loki refuses to load it and tells you to
+remove one. Pick the model that fits: `config.yaml` for an open-ended
+LLM-loop agent, `graph.yaml` for a fixed-shape workflow.
+
+---
+
+# graph.yaml Top-Level Fields
+
+```yaml
+name: my-graph-agent
+description: |
+  Plain prose describing what the workflow does.
+version: "1.0"
+
+# --- agent-level config ---
+model: anthropic:claude-sonnet-4-6   # default model for llm nodes
+temperature: 0.0                     # default sampling temperature
+top_p: null                          # default sampling top-p
+global_tools:                        # global tools available to nodes
+  - web_search_loki.sh
+mcp_servers:                         # MCP servers available to nodes
+  - pubmed-search
+conversation_starters:               # suggested prompts in the UI
+  - "Look up LOINC 2160-0"
+
+settings:
+  max_loop_iterations: 100     # PER-NODE visit cap; default 100 (see below)
+  log_state_snapshots: true    # log state JSON before each node executes
+  validate_before_run: true    # run the graph validator on startup
+  timeout: 600                 # optional overall timeout in seconds
+
+initial_state:                 # optional seed state for the run
+  topic: "auth"
+
+start: parse_input             # required: ID of the first node to run
+
+nodes:
+  parse_input: { ... }
+  ...
+```
+
+- **`version`:** Currently only `"1.0"` is accepted by the parser. Anything
+  else fails at startup. This is the *graph schema* version, not your
+  agent's version.
+- **Agent-level config** (`model`, `temperature`, `top_p`, `global_tools`,
+  `mcp_servers`, `conversation_starters`) are all optional.
+  These are the same fields a normal agent's `config.yaml` carries; in a
+  graph agent they live at the top of `graph.yaml` instead. `model` /
+  `temperature` / `top_p` act as the defaults for `llm` nodes that don't
+  set their own. `global_tools` and `mcp_servers` define the tool universe
+  that an `llm` node's `tools:` whitelist selects from (a node with no
+`tools:` field gets none of them).
+- **`can_spawn_agents` is derived, not declared.** A graph agent can spawn
+  child agents iff its graph contains at least one `agent` node. You don't
+  set a flag. The `agent` node's presence *is* the declaration.
+- **`max_loop_iterations`:** This is a **per-node visit cap**, not a total
+  graph-step cap. If the same node id is entered more than this many times,
+  execution aborts with `Node 'X' visited N times (max_loop_iterations=...)`.
+  Default: 100.
+- **`timeout`:** Wall-clock cap on the entire graph run. The executor
+  checks this between every node transition; nodes that block longer than
+  the timeout will still finish before the check fires.
+- **`initial_state`:** A JSON-compatible object. Values are seeded into
+  state before any node runs and are referenced from any node via `{{key}}`
+  templates.
+
+### `{{initial_prompt}}`: Automatically Seeded
+
+When Loki invokes a graph agent with a user prompt (whether from the
+command line `loki -a my-agent "what is X?"`, from the REPL, or from a
+parent agent that spawned it as a sub-agent), the dispatcher automatically
+seeds the prompt text into state under the key **`initial_prompt`** before
+any node runs.
+
+This means every graph agent's first node can reference the user's request
+via `{{initial_prompt}}`:
+
+```yaml
+parse_input:
+  id: parse_input
+  type: llm
+  prompt: "{{initial_prompt}}"     # the user's command-line / REPL text
+  ...
+```
+
+You do not need to (and should not) put `initial_prompt` in `initial_state` as it is overwritten by the dispatcher.
+
+---
+
+# Node Types
+
+There are seven node types: **agent**, **script**, **approval**, **input**,
+**llm**, **rag**, and **end**. Every node has these common fields:
+
+```yaml
+my_node:
+  id: my_node               # must match the map key
+  type: <one of the seven>
+  description: optional      # free-form
+  next: another_node         # optional default next node; semantics vary per type
+```
+
+The `next` field defines the default routing edge. Node types interpret it
+differently (some types ignore it in favor of internal routing; see each type
+below).
+
+---
+
+## agent
+
+Spawns a Loki sub-agent and waits for it to finish. This is how a graph agent
+delegates a sub-goal to a fully autonomous Loki agent (with its own tool loop
+and configuration).
+
+```yaml
+research_topic:
+  id: research_topic
+  type: agent
+  agent: deep-researcher          # name of an existing Loki agent
+  prompt: "Research {{topic}}"    # interpolated against state
+  timeout: 600                    # optional, in seconds (default 300)
+  state_updates:
+    findings: "{{output}}"
+  output_schema: { ... }          # optional, see "Structured Output" below
+  next: render
+```
+
+- **`agent`:** Name of the child agent to spawn. Must exist in
+  `<loki-config-dir>/agents/`.
+- **`prompt`:** The user message sent to the child agent. Templated against
+  the current graph state.
+- **`timeout`:** Hard wall-clock cap. If the child agent exceeds it, the
+  whole graph fails (no built-in fallback path on agent nodes).
+- **`state_updates`:** Map of `state_key: "{{template}}"`. The child agent's
+  final text is available inside this map as `{{output}}`.
+
+---
+
+## script
+
+Runs a Bash, Python, or TypeScript script and merges its JSON-object stdout
+into state. Script files live under the agent's `scripts/` directory.
+
+**Supported extensions and runtimes**:
+
+| Extension | Runtime invoked            | Notes                                   |
+|-----------|----------------------------|-----------------------------------------|
+| `.sh`     | `bash <script>`            |                                         |
+| `.py`     | `python3 <script>`         | not `python`. Must be Python 3          |
+| `.ts`     | `npx tsx <script>`         | requires Node + `tsx` available on PATH |
+
+`.js` / `.mjs` / other extensions are **not** supported. The shebang line
+inside the script is not used for script-node dispatch (it is for normal
+custom-tools); the file extension is the source of truth.
+
+```yaml
+route_after_parse:
+  id: route_after_parse
+  type: script
+  script: scripts/route_after_parse.py
+  timeout: 30                     # seconds, default 30
+  fallback: handle_error          # optional: where to route on script failure
+  state_updates:                  # applied after stdout merge
+    last_run: "{{some_value}}"
+```
+
+The script receives the current state in two forms; use whichever fits:
+
+| Env var            | Contents                                                      |
+|--------------------|---------------------------------------------------------------|
+| `GRAPH_STATE`      | Inline JSON when serialized state is <= 32 KiB                |
+| `GRAPH_STATE_FILE` | Path to a temp JSON file when serialized state exceeds 32 KiB |
+
+Exactly one of the two is set per script invocation; **always check both**. The temp
+file (when used) is cleaned up automatically after the graph finishes.
+
+The script must print a single JSON object on stdout. All keys merge into
+state; the reserved `_next` key is extracted and overrides the default `next`
+routing.
+
+```python
+#!/usr/bin/env python3
+import json, os
+
+def load_state():
+    if path := os.environ.get("GRAPH_STATE_FILE"):
+        with open(path) as f:
+            return json.load(f)
+    return json.loads(os.environ.get("GRAPH_STATE", "{}"))
+
+state = load_state()
+codes = (state.get("loinc_codes") or "").strip()
+next_node = "query_db" if codes else "ask_for_code"
+print(json.dumps({"_next": next_node, "trimmed_codes": codes}))
+```
+
+**Tolerant-fail**: if the script exits non-zero or produces invalid JSON, the
+node routes to `fallback` (if set) or to `next` (if set). Without either,
+the graph errors.
+
+---
+
+## approval
+
+Prompts the user with a question and a list of options, then routes based on
+their answer. This is the human-in-the-loop checkpoint.
+
+```yaml
+approve:
+  id: approve
+  type: approval
+  question: |
+    Final report:
+    {{report}}
+
+    Approve?
+  options:
+    - "yes"
+    - "no"
+  routes:
+    "yes": end_accepted
+    "no": end_rejected
+  on_other: clarify                # Required - see below
+  state_updates:
+    decision: "{{choice}}"
+```
+
+### The `on_other` field
+
+This field is **required** and easy to miss. Loki's `user__ask` tool *always*
+gives the user a "type your own answer" option in addition to the listed
+options. There is no way to disable this. Without `on_other`, a user who
+types something other than the listed options would crash the graph at
+runtime.
+
+`on_other` says **where to route when the user's answer does not match any
+`routes` key**. The free-form text they typed is available downstream via
+the `{{choice}}` template variable inside `state_updates`.
+
+Common patterns:
+
+- **Free-form means "I want to clarify"** -> `on_other: clarify_node`
+  where `clarify_node` is an `input` or `llm` node that processes their text.
+- **Free-form means "rejection by default"** -> `on_other: end_rejected`.
+
+---
+
+## input
+
+Collects a free-form string from the user.
+
+```yaml
+ask_for_code:
+  id: ask_for_code
+  type: input
+  question: "Enter a LOINC code (e.g. 6690-2):"
+  default: "{{last_used_code}}"   # optional, interpolated against state
+  validation: "len(input) > 0"    # optional, see below
+  state_updates:
+    loinc_code: "{{input}}"
+  next: query_db
+```
+
+- **`default`:** If the user submits an empty response, this template is
+  used. Only `default` itself is templated, not the surrounding question
+  (which is also templated).
+- **`validation`:** A length predicate of the form
+  `len(input) <op> <integer>`, where `<op>` is `>`, `>=`, `<`, `<=`, or `==`.
+  This is a deliberately narrow grammar; regex / type / range validation are
+  not yet supported. If validation fails, the node fails (no fallback).
+- The user's text is exposed to `state_updates` as `{{input}}`.
+
+---
+
+## llm
+
+A one-shot LLM call with an optional bounded tool-call loop. Unlike `agent`
+nodes, this does NOT spawn a sub-agent; it runs in a fresh isolated context
+with a caller-supplied system prompt and user prompt. Tool access is strictly
+opt-in: an `llm` node gets **no tools at all** unless its `tools` field
+explicitly lists them (see below).
+
+```yaml
+grade_research:
+  id: grade_research
+  type: llm
+  instructions: |               # optional system prompt
+    You decide whether research is needed for {{topic}}.
+  prompt: |                     # required user prompt
+    Research context:
+    {{research_text}}
+
+    Reply with YES or NO.
+  tools: []                     # see below
+  model: anthropic:haiku        # optional override
+  temperature: 0.0
+  top_p: null
+  max_attempts: 1               # transient-error retries (default 1)
+  max_iterations: 10            # tool-call-loop turn cap (default 10)
+  fallback: skip                # routes here if all attempts fail
+  state_updates:
+    grade: "{{output}}"
+  output_schema: { ... }        # optional, see "Structured Output" below
+  timeout: 120                  # optional; node wall-clock cap in seconds (unset = no timeout)
+  next: synthesize
+```
+
+### The `tools` field (whitelist)
+
+The `tools` field is a strict opt-in whitelist: an `llm` node receives
+**only** the tools it explicitly lists, never the agent's full tool set.
+Three modes:
+
+- **Unset (field omitted)** -> **no tools**. The LLM produces output but
+  cannot make any tool calls. This is identical to `tools: []`. Leaving the
+  field out does _not_ inherit the agent's tools.
+- **`tools: []`** -> **no tools**. Same as unset.
+- **`tools: [a, b, mcp:server-name]`** -> only those specific tools, and
+  nothing else. Entries are either exact tool names (matching `global_tools`,
+  agent custom tools, or individual MCP function names) or the shorthand
+  `mcp:<server-name>` (which enables all functions for that MCP server).
+
+Even when `tools` lists entries, the LLM receives **exactly** that set. The
+whitelist is enforced against global tools, agent custom tools, and MCP
+alike. Each entry is validated at startup against the active agent's tool
+list; an unknown entry is a startup error.
+
+### Tolerant-fail routing
+
+| Outcome                                  | Routes to                  |
+|------------------------------------------|----------------------------|
+| Success                                  | `next`                     |
+| Failure WITH `fallback` set              | `fallback`                 |
+| Failure WITHOUT `fallback`               | `next` (output is "LLM node failed: ...") |
+
+`state_updates` are always applied (success or failure). On failure,
+`{{output}}` resolves to an error description so downstream nodes can detect
+it.
+
+### Retries (`max_attempts`)
+
+`max_attempts` retries the LLM call **only on transient errors**. The
+failure message containing one of: `timed out`, `rate limit`, `429`,
+`Connection reset`, `Connection refused`, or `produced no output`. Any
+other error fails immediately without consuming further attempts. The
+default is `1` (no retries).
+
+---
+
+## rag
+
+Runs a hybrid (vector + keyword) retrieval against a per-node knowledge base
+and writes the result into state. This is how a graph agent does
+Retrieval-Augmented Generation: the `rag` node retrieves context, downstream
+`llm`/`agent` nodes inject it into their prompts via normal templating.
+
+```yaml
+research_context:
+  id: research_context
+  type: rag
+  documents:                    # required; The knowledge sources
+    - ./knowledge/
+    - https://example.com/spec
+  query: "{{initial_prompt}}"   # templated; defaults to "{{initial_prompt}}"
+  top_k: 5                      # optional; default = the knowledge base's own top_k
+  timeout: 120                  # optional; retrieval timeout in seconds (default 120)
+  state_updates:                # required in practice (see below)
+    rag_context: "{{output.context}}"
+    rag_sources: "{{output.sources}}"
+  next: answer
+
+answer:
+  type: llm
+  prompt: |
+    Use this context to answer:
+    {{rag_context}}
+
+    Question: {{initial_prompt}}
+```
+
+- **`documents`:** Knowledge sources: files, directories, URLs, or
+  loader-protocol paths. **Required**. It's what makes the node a `rag`
+  node. Relative paths resolve against the agent's directory.
+- **`query`:** The retrieval query, templated against state. Defaults to
+  `{{initial_prompt}}`. Set it to `{{refined_query}}` to retrieve against a
+  query an upstream `llm` node produced.
+- **`top_k`:** Number of chunks to retrieve. Defaults to the knowledge
+  base's own configured `top_k`.
+- **`timeout`:** Retrieval timeout in seconds. Default 120.
+- **`state_updates`:** Where the result goes. A `rag` node with no
+  `state_updates` discards its result (the validator warns).
+
+**Knowledge-base build config** (all optional; used only when the knowledge
+base is first built):
+
+- **`embedding_model`:** Embedding model for the corpus.
+- **`chunk_size`:** Document chunk size.
+- **`chunk_overlap`:** Overlap between chunks.
+- **`reranker_model`:** Reranker applied to hybrid-search results.
+- **`batch_size`:** Embedding-request batch size.
+
+Each falls back to the app-level `rag_*` config when omitted. **When
+`embedding_model`, `chunk_size`, and `chunk_overlap` are all set, the
+knowledge base builds with no interactive prompts**. So a fully-specified
+`rag` node works in non-interactive runs.
+
+### `{{output}}` shape
+
+Inside `state_updates`, `{{output}}` is a JSON object:
+
+```json
+{
+  "context": "[Source: ./knowledge/a.md]\n...chunk...",
+  "sources": ["./knowledge/a.md", "https://example.com/spec"]
+}
+```
+
+- `{{output.context}}`: The retrieved context block, ready to inject into a
+  prompt.
+- `{{output.sources}}`: An array of source paths; `{{output.sources[0]}}`
+  indexes individual sources (useful for downstream citation/verification
+  nodes).
+
+### Knowledge base lifecycle
+
+Each `rag` node's knowledge base is built **once, at agent load time**, into
+`<agent-dir>/<node-id>.yaml`:
+
+- If that file exists -> it is loaded (no prompt; works non-interactively).
+- If it's missing and the node is **fully specified** (`embedding_model` +
+  `chunk_size` + `chunk_overlap` all set) -> it is built directly, no
+  prompts. Works in non-interactive runs.
+- If it's missing, not fully specified, and Loki is interactive -> you are
+  asked to initialize it, then prompted for the missing build values;
+  declining is a hard error.
+- If it's missing, not fully specified, and Loki is non-interactive
+  (no TTY) -> hard error, with a hint to set the build-config fields or run
+  the agent once interactively.
+
+A graph with a `rag` node whose knowledge base isn't built **cannot run**.
+This is deliberate fail-fast behavior. (In `--info` mode the agent is only
+inspected, not run, so knowledge-base building is skipped entirely.)
+
+### Retrieval
+
+Retrieval at execution time is fast (no re-embedding of the corpus). It's
+the same hybrid vector + keyword search normal Loki RAG uses. The corpus
+embedding/chunking cost is paid once, at load time.
+
+---
+
+## end
+
+Terminates execution and returns a final result.
+
+```yaml
+end_accepted:
+  id: end_accepted
+  type: end
+  output: |
+    Approved report:
+    {{report}}
+  state_updates:                # optional last state mutations
+    completed_at: "now"
+```
+
+- **`output`:** Templated against state, printed as the graph's final
+  result.
+- Multiple `end` nodes are fine; you pick which one routes here based on
+  upstream conditions.
+
+---
+
+# State and Template Syntax
+
+Graph state is a `serde_json::Value` map. Templates use `{{path}}` syntax
+inside any string field.
+
+| Form                          | Resolves to                                  |
+|-------------------------------|----------------------------------------------|
+| `{{key}}`                     | top-level value                              |
+| `{{a.b.c}}`                   | nested object path                           |
+| `{{arr[0]}}`                  | array index                                  |
+| `{{matrix[0][1]}}`            | nested array indices                         |
+| `{{users[0].name}}`           | object field via index                       |
+| `{{a.b.arr[2].field}}`        | mixed path                                   |
+
+Rendering rules per value type:
+
+- **String** -> as-is
+- **Number / bool / null** -> stringified (`true`, `42`, `null`)
+- **Array / Object** -> JSON-encoded compactly (`["a","b"]`, `{"k":"v"}`)
+
+Missing keys / paths behave differently per template-evaluation site:
+
+- Inside a node's primary fields (`prompt`, `instructions`, `question`,
+  `output`) -> strict mode, missing keys raise an error.
+- Inside `state_updates` values -> lenient mode, missing keys become empty
+  strings.
+
+---
+
+# state_updates
+
+Every node type (except `end`, which has a slightly different shape) accepts
+an optional `state_updates` map:
+
+```yaml
+state_updates:
+  some_key: "{{template}}"
+  other_key: "literal text with {{var}}"
+```
+
+After the node body executes, each template is interpolated against state and
+the result is stored under the corresponding key. Three scoped variables are
+available *only inside `state_updates`*:
+
+| Variable     | Available in       | Resolves to                                                   |
+|--------------|--------------------|----------------------------------------------------------------|
+| `{{output}}` | `agent`, `llm`     | The node's primary text output (or parsed JSON value if `output_schema` is set) |
+| `{{choice}}` | `approval`         | The option the user picked, or their free-form text            |
+| `{{input}}`  | `input`            | The user's text (or interpolated `default` if they submitted empty) |
+
+These variables are cleared after `state_updates` runs, so they don't leak
+into the next node's templates.
+
+> **End nodes are different.** An `end` node's `state_updates` runs with
+> plain lenient interpolation. There is no scoped `{{output}}` because
+> there is no node-body output to scope. After `state_updates` apply, the
+> `end` node's own `output` template is interpolated against the resulting
+> state and returned as the graph's final result.
+
+---
+
+# Routing & Tolerant-Fail
+
+Nodes route via three mechanisms in priority order:
+
+1. **Script `_next` override:** `script` nodes can set `"_next": "node_id"`
+   in their stdout JSON to dynamically choose the next node.
+2. **Internal routing:** `approval` routes via its `routes` map (or
+   `on_other` when the answer matches no listed option).
+3. **Default `next` edge:** the `next` field on the node.
+
+### Routing requirements per node type
+
+| Node type   | Needs `next`?                                                                                     |
+|-------------|---------------------------------------------------------------------------------------------------|
+| `agent`     | **Yes** - `next` is required (unless the agent node is unreachable). Error at runtime if missing. |
+| `script`    | Either `_next` from script output OR static `next` (or `fallback` on failure). Error if neither.  |
+| `approval`  | No - routing is via `routes` and `on_other`. `next` is ignored.                                   |
+| `input`     | **Yes** - `next` is the success route.                                                            |
+| `llm`       | **Yes** - `next` is the success route (and the default for failures without `fallback`).          |
+| `rag`       | **Yes** - `next` is required. Error at runtime if missing.                                        |
+| `end`       | No - terminal.                                                                                    |
+
+### Tolerant-fail contract
+
+Currently honored by `script` and `llm` nodes:
+
+- Success -> default routing
+- Failure with `fallback` set -> `fallback` target
+- Failure without `fallback` -> default routing, with the error description
+  exposed in state so the next node can react
+
+`agent` and `input` nodes do NOT have a tolerant-fail `fallback` path;
+their failures propagate as graph failures.
+
+---
+
+# Structured Output (`output_schema`)
+
+Both `llm` and `agent` nodes can specify an `output_schema` field: a JSON
+Schema (written inline in YAML) describing the expected shape of the node's
+output:
+
+```yaml
+extract_task:
+  type: llm
+  prompt: 'Parse: "{{raw_task}}"'
+  output_schema:
+    type: object
+    properties:
+      action: { type: string }
+      items:
+        type: array
+        items: { type: string }
+      time_minutes: { type: ["integer", "null"] }
+      priority:
+        type: string
+        enum: [low, medium, high]
+    required: [action, items, priority]
+```
+
+When `output_schema` is set:
+
+1. The node body runs normally.
+2. The raw text output is **tried as JSON first** (with light cleanup of
+   markdown code fences); the fast path. If parsing succeeds, that's the
+   structured output.
+3. Otherwise Loki invokes a built-in `__structured_output__` role
+   (constructed inline; not visible in the user's role list) to extract a
+   JSON object matching the schema. One repair retry on extractor failure.
+4. When the parsed value is a JSON **object**, its **top-level keys
+   auto-merge into state permanently** (a non-object result is still
+   reachable via `{{output}}` but has no top-level keys to merge).
+5. `{{output}}` (inside `state_updates`) resolves to the full parsed value.
+6. Explicit `state_updates` win over auto-merge if the same key is set in
+   both.
+
+After the example above, downstream nodes can use `{{action}}`, `{{items}}`,
+`{{items[0]}}`, `{{priority}}`, etc. directly.
+
+### LLM nodes vs Agent nodes: schema-hint injection
+
+This is the **most important behavioral difference** between the two node
+types when `output_schema` is set:
+
+- **LLM nodes**: Loki automatically appends a schema hint to the prompt
+  (to the system prompt if `instructions` is set, otherwise to the user
+  prompt). The hint tells the model to respond with JSON matching the
+  schema. This means the main LLM call usually emits valid JSON directly ->
+  the fast path succeeds -> the extractor LLM call is skipped entirely
+  (cheaper, faster, more reliable).
+- **Agent nodes**: Loki does NOT inject any schema hint. Agents are
+  multi-turn with their own tool-use loop; stuffing a schema into the
+  initial prompt risks the agent fixating on JSON output instead of doing
+  its actual work. The agent runs to completion freely, and the extractor
+  converts its final text to JSON afterward.
+
+If you need an agent to emit JSON-shaped output, include schema language in
+its prompt yourself. The auto-injected hint for LLM nodes uses this form:
+
+```
+Respond with a JSON object that matches this schema. Output ONLY the JSON
+object with no surrounding prose or markdown fences.
+
+Schema:
+{...}
+```
+
+### Tolerant-fail for extraction
+
+- **LLM node**: extraction failure = node failure -> routes via `fallback`
+  or `next`.
+- **Agent node**: extraction failure propagates as a graph error (agent
+  nodes have no `fallback`).
+
+---
+
+# Worked Example
+
+A compact illustrative graph -`input` -> `llm` (with `output_schema`) ->
+`end` - exercising structured output and all template-path forms. For a
+**full-featured reference** covering every node type and field, see the
+heavily-commented `graph.example.yaml` at the root of the Loki repository.
+
+Illustrative `graph.yaml`:
+
+```yaml
+name: structured-test
+version: "1.0"
+start: ask_task
+
+nodes:
+  ask_task:
+    id: ask_task
+    type: input
+    question: "Describe a task in free-form text."
+    validation: "len(input) > 0"
+    state_updates:
+      raw_task: "{{input}}"
+    next: extract_task
+
+  extract_task:
+    id: extract_task
+    type: llm
+    instructions: |
+      You are a task parser. If a field cannot be determined, use a sensible
+      default (empty array, null, or "medium" for priority).
+    prompt: 'Parse this task description: "{{raw_task}}"'
+    tools: []
+    output_schema:
+      type: object
+      properties:
+        action: { type: string }
+        items:
+          type: array
+          items: { type: string }
+        time_minutes: { type: ["integer", "null"] }
+        priority:
+          type: string
+          enum: [low, medium, high]
+        details:
+          type: object
+          properties:
+            urgent: { type: boolean }
+            deadline: { type: ["string", "null"] }
+          required: [urgent]
+      required: [action, items, priority, details]
+    next: done
+
+  done:
+    id: done
+    type: end
+    output: |
+      Action:        {{action}}
+      Priority:      {{priority}}
+      Time:          {{time_minutes}} min
+      Urgent?        {{details.urgent}}
+      First item:    {{items[0]}}
+      All items:     {{items}}
+```
+
+With the sample input `Buy groceries: milk, eggs, bread. About 15 minutes. Urgent.`
+
+Sample state after `extract_task`:
+
+```json
+{
+  "raw_task": "Buy groceries: milk, eggs, bread. About 15 minutes. Urgent.",
+  "action": "buy",
+  "items": ["milk", "eggs", "bread"],
+  "time_minutes": 15,
+  "priority": "high",
+  "details": { "urgent": true, "deadline": null }
+}
+```
+
+---
+
+# Validation
+
+When `validate_before_run: true` (the default), Loki validates the graph at
+startup.
+
+**Errors (abort startup)**:
+
+- Start node missing or pointing to a non-existent node
+- Any `next` / `routes` / `fallback` / `on_other` target pointing to a
+  non-existent node
+- Any cycle in declared static edges (cycles are always errors. The
+  per-node `max_loop_iterations` is a runtime safety net for dynamically-
+  routed loops, not a license for static cycles)
+- Graph has zero `end` nodes. Execution would never terminate
+- `approval` option without a matching `routes` entry
+- `script` file path does not exist relative to the agent's directory
+- `agent` node references an agent name that doesn't exist in the
+  loki agents directory, or that exists but has neither a `config.yaml`
+  nor a `graph.yaml`
+- `rag` node with no `documents` (at least one knowledge source is required)
+- `llm` node referencing an unknown tool or `mcp:<server>` in its `tools`
+  whitelist, or an unknown `model`. Validated against the agent's tool,
+  MCP-server, and model sets
+
+**Warnings (printed, execution continues)**:
+
+- Any node unreachable from the start via declared static edges
+- No `end` node reachable from the start via declared static edges
+- `approval` `routes` entry without a matching option
+- `rag` node with no `state_updates` (its retrieval result goes nowhere)
+
+> **Why some of these are warnings and not errors:** the validator only
+> follows **declared static edges** (`next`, `routes`, `fallback`,
+> `on_other`). Script nodes can also route dynamically at
+> runtime via `_next` in their JSON output, and those edges are invisible
+> to static analysis. To avoid false positives against dynamically-routed
+> graphs, "unreachable" and "no reachable end" are reported as warnings,
+> not errors.
+
+---
+
+# Invocation Entry Points
+
+A graph agent can be entered from three places, all of which seed the
+caller's prompt into state as `{{initial_prompt}}`:
+
+1. **Top-level CLI:** `loki -a my-graph-agent "user prompt here"`
+2. **REPL:** When the active agent has a `graph.yaml`, every user
+   message in the REPL runs the graph fresh; the message becomes
+   `{{initial_prompt}}`
+3. **Child-agent spawn:** When another (graph or normal) agent invokes
+   this one via Loki's sub-agent mechanism, the parent's request becomes
+   `{{initial_prompt}}` for the child graph
+
+After the graph finishes, any sub-agents this graph spawned via
+`agent`-type nodes are cancelled, so a graph cannot leak background tool
+loops. The graph's final `end` node output is what's returned to the
+caller.
+
+---
+
+# Streaming and Observability
+
+Graph execution has two observability channels:
+
+**1. stderr narration:** Dimmed `▸` lines you follow along with in real
+time, regardless of log level:
+
+```
+▸ graph: my-agent (start: extract_task)
+▸ extract_task (llm)
+▸   llm call: model=<active> tools=<none>
+▸ extract_task -> done
+▸ done (end)
+▸ graph done in 2.41s
+```
+
+**2. `tracing` logs:** Structured `info!`/`debug!`/`warn!`/`error!`
+records gated by `RUST_LOG` (see [Configuration](#configuration) below).
+This is the developer-facing channel and includes:
+
+- Graph start / completion / failure
+- Per-node entry and routing decisions (`debug`)
+- A **performance summary** at completion — every node's visit count,
+  total/avg/max wall-clock time, slowest first:
+  ```
+  [graph:my-agent] performance summary (slowest first):
+  [graph:my-agent]   deep_research: 1 visit(s), total 8200ms, avg 8200ms, max 8200ms
+  [graph:my-agent]   extract_task: 1 visit(s), total 1400ms, avg 1400ms, max 1400ms
+  ```
+
+**State snapshots**: when `log_state_snapshots: true` (the default), before
+each node runs Loki logs the state's byte size and key list at `debug`
+level, and the *full* state at `trace` level. The full state is
+deliberately kept at `trace` because graph state can contain secrets so 
+be careful sharing `trace`-level logs.
+
+## Configuration
+
+Control the `tracing` channel with `RUST_LOG`:
+
+```sh
+RUST_LOG=loki::graph=debug    loki -a my-agent "..."   # graph debug logs
+RUST_LOG=loki::graph=trace    loki -a my-agent "..."   # + full state snapshots
+RUST_LOG=loki::graph=info     loki -a my-agent "..."   # start/end/perf summary
+```
+
+The stderr `▸` narration is always shown and is not affected by `RUST_LOG`.
+
+---
+
+# Limitations / Gotchas
+
+A short, honest list of things that bite people:
+
+- **A graph agent is `graph.yaml`-only**. It must not also have a
+  `config.yaml`. Both files present is a hard load error.
+- **Graph agents do not support sessions**. A graph manages its own state
+  (`GraphState`), so there is no conversational history to persist.
+  Explicitly requesting a session is a hard error. `--session` on the
+  CLI, a session name passed to `.agent` in the REPL, or running
+  `.session` while inside a graph agent. Any app-level `agent_session`
+  default is silently skipped for graph agents rather than applied.
+- **RAG is per-node, not agent-wide**. Graph agents do RAG via `rag`
+  nodes (each with its own knowledge base); there is no agent-wide
+  `documents` field at the `graph.yaml` top level.
+- **A `rag` node's knowledge base is built once, at load time**. Changing
+  a `rag` node's `documents` does not rebuild it. Delete
+  `<agent-dir>/<node-id>.yaml` to force a fresh build on next run.
+- **`on_other` is required on every `approval` node** because `user__ask`
+  always permits free-form responses (see [the approval section](#approval)).
+- **`validation` on `input` nodes is length-only**. The grammar is
+  `len(input) <op> <integer>` with `<op>` in `> >= < <= ==`. No regex, no
+  type coercion, no range checks. Use a follow-up `script` node for richer
+  validation.
+- **An `input` node's `default` is not re-validated.** When the user
+  submits an empty response and the `default` is substituted in, that
+  substituted value is *not* checked against `validation`. Make sure any
+  `default` you set would itself satisfy the `validation` predicate.
+- **Tool whitelist is `llm`-only**. `agent` nodes always use the child
+  agent's full tool universe. They ignore any `tools:` field. This is by
+  design: child agents own their tool surface.
+- **`{{output}}`, `{{choice}}`, `{{input}}` are scoped to `state_updates`**.
+  Outside `state_updates` (e.g. in another node's `prompt`), these
+  scoped variables are not available unless the previous node explicitly
+  stored them via `state_updates`. `end` nodes do NOT get a scoped
+  `{{output}}`. They have no node body output to scope.
+- **Schema-hint auto-injection happens for `llm` nodes only**, not
+  `agent` nodes (see [Structured Output](#structured-output-output_schema)).
+- **Script-output JSON must be an object**, not an array or primitive,
+  even if you only want to set `_next`.
+- **Cycles in declared static edges are always errors**. The per-node
+  `max_loop_iterations` is a runtime *safety net* for cycles built via
+  dynamic `script._next` routing, not permission to write static cycles.
+- **Schema version is fixed at `"1.0"`** today. Any other value is a
+  startup error.
+- **Script extensions are exactly `.sh`, `.py`, `.ts`**. No JavaScript,
+  no Ruby, no Lua. Python must be available as `python3` and TypeScript
+  requires `npx tsx` on PATH.
+
+---
+
+# See Also
+
+- [`graph.example.yaml`](https://github.com/Dark-Alex-17/loki/blob/main/graph.example.yaml) - A fully-commented, full-featured reference
+  graph agent at the root of the Loki repository (every top-level field,
+  every node type).
+- [Agents](Agents) - non-graph agent system (config.yaml + LLM loop)
+- [Custom Tools](Custom-Tools) - building `tools.sh` / `tools.py` /
+  `tools.ts` files for use in graph nodes
+- [Roles](Roles) - note that the built-in `__structured_output__` role used
+  by `output_schema` is intentionally internal and is not user-visible
+- [MCP Servers](MCP-Servers) - `mcp:<server>` shorthand inside an `llm`
+  node's `tools:` whitelist
@@ -34,6 +34,11 @@
  - [Sub-Agent Spawning](Agents#7-sub-agent-spawning-system)
  - [User Interaction Tools](Agents#8-user-interaction-tools)
  - [Built-In Agents](Agents#built-in-agents)
+- [Graph Agents](Graph-Agents)
+  - [Node Types](Graph-Agents#node-types)
+  - [State & Templates](Graph-Agents#state-and-template-syntax)
+  - [Structured Output](Graph-Agents#structured-output-output_schema)
+  - [Limitations](Graph-Agents#limitations--gotchas)

 ## Knowledge & Automation
 - [RAG](RAG)