feat: Created the step-runner graph agent for more deterministic coding workflows to produce even more reliable and higher-quality results

2026-07-04 12:50:37 -06:00
parent 159afbbc06
commit 9d2e936e7f
15 changed files with 1333 additions and 0 deletions
@@ -132,6 +132,7 @@ instructions: |
  | `librarian` | Find official docs, OSS examples, web best practices for EXTERNAL libraries | Read-only, returns citation-backed findings, fan out 1-3 in parallel |
  | `coder` | Write/edit files, implement features | Graph agent: plan → approval → implement → verify build+tests → self_review → bounded fix-loop |
  | `oracle` | Architecture, complex debugging, review, plan review | Advisory, blocking — never answer the user before collecting Oracle results |
  | `step-runner` | Execute ONE step of a phased plan repo (Phase 8) | Graph agent: orient → staleness check → coder → verify → handoff → user approval gate |
  ### When to fire `librarian` (external grep) vs `explore` (internal grep)
@@ -333,6 +334,10 @@ instructions: |
  ### Execution lifecycle (one step at a time)
  **Default: delegate the whole step to `step-runner`** — a graph agent that enforces the step protocol as graph edges (orient → staleness check → coder → verify → edge-case sweep → optional independent review → validated handoff → user approval gate): `agent__spawn --agent step-runner --prompt "Execute step <N> of the plan at <plans_dir>"`. It returns `STEP_COMPLETE` / `STEP_BLOCKED` / `STEP_REJECTED` / `STEP_FAILED`. Relay its escalations (deviation gate, approval gate) promptly. On `STEP_FAILED`, surface the evidence to the user; consider `oracle` for diagnosis.
  Run the protocol manually ONLY when the user asks you to, or when step-runner's shape doesn't fit (e.g. a docs-only step with nothing to build). Then:
  1. Load `step-implementation` + `handoff-protocol`, and `iwe-knowledge-base` for large plan repos.
  2. Follow the step protocol phase by phase: orient (previous handoff + `NOTES.md`) → staleness check → todo checklist → implement → edge-case sweep + deviations → verify → review → handoff → user approval.
  3. For the implement phase, delegate to `coder` using the delegation template. Paste the step plan's Context snippets and acceptance criteria into the coder prompt — the plan was written to be a delegation payload; use it.
@@ -0,0 +1,93 @@
 # Step-Runner
 A graph-based agent that executes **one step** of a phased implementation
 plan, with the step protocol from the `step-implementation` skill enforced
 as graph edges rather than prose. Designed to be delegated to by
 **[Sisyphus](../sisyphus/README.md)**; delegates implementation to
 **[Coder](../coder/README.md)** and independent review to
 **[code-reviewer](../code-reviewer/README.md)**.
 It expects a plan repo authored per the `plan-authoring` skill:
 ```
 plans/
  steps/NN-<slug>.md    # step plans with frontmatter (step/title/depends_on/status)
  handoffs/NN-<slug>.md # written by this agent, validated by a deterministic gate
  NOTES.md              # rolling durable facts
 ```
 ## Workflow
 ```
 resolve_step (script)         locate plan + previous handoff, check depends_on,
        ↓                     mark plan in-progress   [→ gate_blocked if deps unsatisfied]
 orient (llm, read-only)       merge handoff directives + staleness-check the plan
        ↓
 route_staleness (script)      major deviation → gate_deviation (approval)
        ↓
 implement (agent → coder)     coder runs its own build/test/self-review fix-loop
        ↓
 route_coder_result (script)   COMPLETE → verify | REJECTED / FAILED → end
        ↓
 verify_format_lint (script)   format BEFORE evidence, then lint
 verify_build (script)         step-level build/typecheck
 verify_tests (script)         FULL test suite
        ↓                     [failures → fix_loop_gate, back-edge to implement]
 edge_case_sweep (llm)         missed edge cases; annotate downstream plans
        ↓                     (Edge cases sections ONLY - scope changes become proposals)
 route_sweep (script)          5+ files or architectural boundary → independent_review
 independent_review (agent)    code-reviewer; 🔴 findings loop back to implement (bounded)
        ↓
 write_handoff (llm)           evidence-backed handoff per handoff-protocol + NOTES.md
 check_handoff (script)        deterministic schema gate; marks plan status complete
        ↓
 gate_user_review (approval)   HARD STOP - approve, or send revision comments
        ↓                     (revisions loop through implement → verify → handoff again)
 end_success / end_blocked / end_rejected / end_failure
 ```
 End nodes emit sentinel outcomes for the caller:
 - `STEP_COMPLETE` — step implemented, verified, handoff written, user approved.
 - `STEP_BLOCKED` — `depends_on` unsatisfied and the user declined to proceed.
 - `STEP_REJECTED` — user aborted at the deviation gate, or the coder's plan
  was rejected at its approval gate.
 - `STEP_FAILED` — coder failed, the step-level fix budget was exhausted, or
  the handoff failed validation twice.
 ## Usage
 ```sh
 # From the project root: run the next in-progress/pending step
 coyote -a step-runner "Execute the next step"
 # A specific step (also parsed from the prompt: "execute step 3")
 coyote -a step-runner --agent-variable step 3 "Execute step 3"
 # Plan repo somewhere else
 coyote -a step-runner --agent-variable plans_dir docs/plans "Execute the next step"
 ```
 **Invoke from the project root.** The coder sub-agent resolves its own
 `project_dir` from the invocation directory; overriding `project_dir` here
 does not propagate to the spawned coder.
 ## Tuning
 `graph.yaml` `initial_state` exposes:
 - `max_fix_attempts` (default `2`) — step-level fix budget (the coder has
  its own internal budget of 3).
 - `max_review_attempts` (default `1`) — bounded 🔴-finding fix loops after
  independent review.
 Environment overrides honored by the script nodes:
 - `FORMAT_CMD` / `LINT_CMD` — formatting and linting (otherwise a per-type
  heuristic formats, and linting defers to the build/check command).
 - `BUILD_CMD` / `TEST_CMD` — skip project-type detection (same as coder).
 - `STEP_AUTOAPPROVE=1` — bypass the deviation gate (non-interactive runs).
 - `STEP_SKIP_REVIEW=1` — never spawn the independent reviewer.
 The final user approval gate is never bypassed by an environment variable -
 it is the point of the workflow.
@@ -0,0 +1,599 @@
 name: step-runner
 description: |
  Executes ONE step of a phased implementation plan (plans/ repo) with the
  step protocol enforced as graph edges: orient -> staleness check ->
  implement (coder) -> verify -> edge-case sweep -> optional independent
  review -> evidence-backed handoff -> user approval gate. Designed to be
  delegated to by sisyphus.
 version: "1.0"
 global_tools:
  - fs_cat.sh
  - fs_ls.sh
  - fs_write.sh
  - fs_patch.sh
  - execute_command.sh
 skills_enabled: true
 enabled_skills:
  - step-implementation
  - handoff-protocol
  - code-review
  - ai-slop-remover
 variables:
  - name: project_dir
    description: |
      Absolute path to the project directory. Defaults to "." (the directory
      coyote was invoked from). The coder sub-agent resolves its own
      project_dir the same way, so invoke step-runner FROM the project root
      unless you override this for both.
    default: "."
  - name: plans_dir
    description: |
      Path to the plan repo. Relative paths resolve against project_dir.
      Expected layout: <plans_dir>/steps/NN-<slug>.md,
      <plans_dir>/handoffs/, <plans_dir>/NOTES.md.
    default: "plans"
  - name: step
    description: |
      Which step to execute: a step number, or "next" to pick the first
      in-progress (resume) or pending step plan.
    default: "next"
 settings:
  max_loop_iterations: 20
  log_state_snapshots: true
  validate_before_run: true
  timeout: 7200
 initial_state:
  project_dir: ""
  plans_dir: ""
  step_number: 0
  step_slug: ""
  step_title: ""
  step_plan_path: ""
  step_plan: ""
  prev_handoff_path: "(none)"
  prev_handoff: "(none - this is the first step)"
  notes_path: ""
  notes: "(none)"
  handoff_path: ""
  blocking_reason: ""
  plan_summary: ""
  implementation_brief: ""
  staleness_report: ""
  has_major_deviation: false
  deviation_summary: ""
  user_feedback: ""
  fix_instructions: ""
  fix_attempts: 0
  max_fix_attempts: 2
  coder_result: ""
  format_output: ""
  lint_ok: true
  lint_output: ""
  build_ok: true
  build_output: ""
  tests_ok: true
  tests_output: ""
  edge_case_report: ""
  downstream_updates: ""
  needs_independent_review: false
  review_report: ""
  review_attempts: 0
  max_review_attempts: 1
  handoff_attempts: 0
  handoff_fix: ""
  step_summary: ""
 start: resolve_step
 nodes:
  resolve_step:
    id: resolve_step
    type: script
    description: |
      Locate the step plan, previous handoff, and NOTES.md; parse frontmatter;
      check depends_on satisfaction against existing handoffs; mark the plan
      in-progress. Routes to gate_blocked when dependencies are unsatisfied.
    script: scripts/resolve_step.sh
    timeout: 30
    fallback: end_failure
    next: orient
  gate_blocked:
    id: gate_blocked
    type: approval
    description: Escalate unsatisfied dependencies instead of building on missing ground.
    question: |
      Step {{step_number}} ({{step_title}}) is BLOCKED:
      {{blocking_reason}}
      Proceed anyway?
    options:
      - "yes"
      - "no"
    routes:
      "yes": orient
      "no": end_blocked
    on_other: end_blocked
  orient:
    id: orient
    type: llm
    description: |
      Read-only orientation and staleness check: merge the previous handoff's
      directives with the step plan, then verify the plan's assumptions
      against the CURRENT codebase before any edit.
    skills_enabled: true
    enabled_skills:
      - step-implementation
    instructions: |
      You are orienting for one step of a phased implementation plan. Load
      `step-implementation` and apply its Orient and Staleness-check phases.
      You are READ-ONLY in this node: no edits, no fixes.
      1. Read the previous handoff (below). Note directives aimed at this
         step, deviations that changed the codebase, and bare assertions
         that need re-verification.
      2. Staleness-check the step plan against the code at {{project_dir}}:
         grep the symbols it references (via execute_command), read its
         Context snippets at their claimed locations with fs_cat, confirm
         its Test commands exist.
      3. Classify discrepancies per the skill's deviation table: minor
         (mechanics differ; correct silently in the brief) vs major (scope,
         approach, interfaces, or a later step's assumptions affected).
      Produce `implementation_brief`: the corrected, self-contained marching
      orders for the implementer - plan tasks in order, handoff directives
      applied, minor staleness corrections folded in, acceptance criteria
      restated. The implementer sees ONLY the step plan plus your brief.
    prompt: |
      ## Step plan ({{step_plan_path}})
      {{step_plan}}
      ## Previous handoff ({{prev_handoff_path}})
      {{prev_handoff}}
      ## Rolling project notes
      {{notes}}
    tools:
      - fs_cat
      - fs_ls
      - execute_command
    max_iterations: 20
    output_schema:
      type: object
      properties:
        plan_summary:
          type: string
          description: 1-3 sentences summarizing what this step delivers
        implementation_brief:
          type: string
          description: Corrected, self-contained instructions for the implementer
        staleness_report:
          type: string
          description: Findings from checking plan assumptions against current code; "clean" if none
        has_major_deviation:
          type: boolean
          description: True when a discrepancy changes scope, approach, or interfaces
        deviation_summary:
          type: string
          description: Major deviations only, with the plan claim vs current reality. Empty when none
      required: [plan_summary, implementation_brief, staleness_report, has_major_deviation, deviation_summary]
    fallback: end_failure
    next: route_staleness
  route_staleness:
    id: route_staleness
    type: script
    description: Major deviation -> user gate; otherwise straight to implement.
    script: scripts/route_staleness.sh
    timeout: 5
    fallback: implement
  gate_deviation:
    id: gate_deviation
    type: approval
    description: Major deviations are never silently absorbed - the user decides.
    question: |
      Step {{step_number}} ({{step_title}}): the plan no longer matches the
      codebase in a way that changes scope or approach.
      {{deviation_summary}}
      Staleness report:
      {{staleness_report}}
      Proceed with the corrected brief? (Answer with anything else to give
      your own guidance to the implementer.)
    options:
      - "proceed"
      - "abort"
    routes:
      "proceed": implement
      "abort": end_rejected
    on_other: implement
    state_updates:
      user_feedback: "{{choice}}"
  implement:
    id: implement
    type: agent
    description: |
      Delegate implementation to the coder graph agent, which runs its own
      plan -> implement -> build -> tests -> self-review fix-loop internally.
    agent: coder
    prompt: |
      ## TASK
      Execute step {{step_number}} ({{step_title}}) of a phased implementation
      plan for the project at {{project_dir}}.
      ## EXPECTED OUTCOME
      Every task in the step plan below is implemented and its acceptance
      criteria are met. Tests are derived from the Acceptance criteria
      section (not from the implementation). Build and full test suite pass.
      ## MUST DO
      - Follow the Orientation brief below - it supersedes the raw plan where
        they disagree (it folds in corrections from the staleness check).
      - Match the patterns pasted in the step plan's Context section.
      - Derive tests from the plan's Acceptance criteria.
      ## MUST NOT DO
      - Do not touch anything listed in the plan's Out of scope section.
      - Do not modify files under {{plans_dir}}.
      - Do not implement work belonging to other steps.
      ## CONTEXT
      ### Step plan
      {{step_plan}}
      ### Orientation brief (handoff directives + staleness corrections applied)
      {{implementation_brief}}
      ### User guidance (if any)
      {{user_feedback}}
      ### Fix loop status (empty on first attempt)
      {{fix_instructions}}
    timeout: 3600
    state_updates:
      coder_result: "{{output}}"
    next: route_coder_result
  route_coder_result:
    id: route_coder_result
    type: script
    description: Route on the coder sentinel - COMPLETE verifies, REJECTED/FAILED terminate.
    script: scripts/route_coder_result.sh
    timeout: 5
    fallback: end_failure
  verify_format_lint:
    id: verify_format_lint
    type: script
    description: |
      Format BEFORE evidence collection (FORMAT_CMD override or per-type
      heuristic), then lint (LINT_CMD, when configured). Lint failure routes
      to the fix loop.
    script: scripts/verify_format_lint.sh
    timeout: 300
    fallback: fix_loop_gate
  verify_build:
    id: verify_build
    type: script
    description: Step-level build/typecheck evidence, collected AFTER formatting.
    script: scripts/verify_build.sh
    timeout: 600
    fallback: fix_loop_gate
  verify_tests:
    id: verify_tests
    type: script
    description: FULL test suite - regressions in untouched code fail the step too.
    script: scripts/verify_tests.sh
    timeout: 1200
    fallback: fix_loop_gate
  fix_loop_gate:
    id: fix_loop_gate
    type: script
    description: |
      Step-level fix budget (the coder already ran its own internal fix
      loop). Loops to implement with fix_instructions, or ends as failure.
    script: scripts/fix_loop_gate.sh
    timeout: 5
    fallback: end_failure
  edge_case_sweep:
    id: edge_case_sweep
    type: llm
    description: |
      Post-implementation sweep: missed spots, edge cases, downstream plan
      implications. May annotate downstream plans' Edge cases sections
      (annotate vs propose per handoff-protocol). Also judges whether the
      change warrants an independent review pass.
    skills_enabled: true
    enabled_skills:
      - step-implementation
      - handoff-protocol
    instructions: |
      The implementation for this step just passed build and tests. Load
      `step-implementation` (edge-case sweep phase) and `handoff-protocol`
      (annotate-vs-propose rules), then:
      1. Read the changed code (the coder result below names the files).
         Look for edge cases the plan missed: empty inputs, error paths,
         concurrency, partial failure, compat.
      2. For each edge case belonging to a LATER step: check that step's
         plan under {{plans_dir}}/steps/. If its Edge cases section already
         covers it, done. If not, append an entry to that section via
         fs_patch - touch NOTHING else in the file.
      3. NEVER edit a later plan's Objective, Tasks, Acceptance criteria,
         or Out of scope. Scope-affecting changes become proposed diffs in
         `downstream_updates` instead.
      4. Set needs_independent_review=true when the change touched 5+ files
         or crosses architectural boundaries (auth, public APIs, schema,
         security-sensitive paths).
      Be terse. Findings, not prose.
    prompt: |
      ## Coder result
      {{coder_result}}
      ## Step plan
      {{step_plan}}
      ## Staleness report from orientation
      {{staleness_report}}
    tools:
      - fs_cat
      - fs_ls
      - fs_patch
      - execute_command
    max_iterations: 20
    output_schema:
      type: object
      properties:
        edge_case_report:
          type: string
          description: Edge cases discovered - both handled and punted, one per line. "none" if empty
        downstream_updates:
          type: string
          description: Annotations made (plan file + section) and proposed diffs for scope-affecting changes. "none" if empty
        needs_independent_review:
          type: boolean
      required: [edge_case_report, downstream_updates, needs_independent_review]
    fallback: write_handoff
    next: route_sweep
  route_sweep:
    id: route_sweep
    type: script
    description: Broad or boundary-crossing changes get an independent reviewer.
    script: scripts/route_sweep.sh
    timeout: 5
    fallback: write_handoff
  independent_review:
    id: independent_review
    type: agent
    description: Independent review pass - the author's self-review cannot catch its own rationalizations.
    agent: code-reviewer
    prompt: |
      Review the changes produced for step {{step_number}} ({{step_title}})
      of a phased implementation plan in {{project_dir}}.
      What the step was supposed to do:
      {{plan_summary}}
      Coder summary (names the modified/created files):
      {{coder_result}}
      Review the changed files against the step plan's acceptance criteria.
      Preserve severity tags in your findings.
    timeout: 1200
    state_updates:
      review_report: "{{output}}"
    next: route_review
  route_review:
    id: route_review
    type: script
    description: Critical findings loop back to implement (bounded); otherwise proceed to handoff.
    script: scripts/route_review.sh
    timeout: 5
    fallback: write_handoff
  write_handoff:
    id: write_handoff
    type: llm
    description: |
      Write the evidence-backed handoff per handoff-protocol and append
      durable facts to NOTES.md. The completion gate (check_handoff)
      verifies the document afterward.
    skills_enabled: true
    enabled_skills:
      - handoff-protocol
      - ai-slop-remover
    instructions: |
      Load `handoff-protocol` and follow its writer schema EXACTLY: the
      frontmatter (step, title, result) and all eight sections, writing
      "None" rather than omitting a section.
      Write the handoff to {{handoff_path}} with fs_write. Paste the
      verification evidence below verbatim into the Evidence section -
      commands, exit codes, decisive output lines. Deviations come from the
      staleness report, gate decisions, and fix loop history. Downstream
      plan updates come from the sweep results.
      Then append durable, step-independent facts (if any) to {{notes_path}}
      - create the file if missing, never rewrite existing entries.
      If "Gate feedback" below is non-empty, a previous handoff attempt
      failed validation - fix exactly what it lists.
    prompt: |
      ## Step
      {{step_number}} ({{step_title}}) - plan at {{step_plan_path}}
      ## Plan summary
      {{plan_summary}}
      ## Coder result
      {{coder_result}}
      ## Staleness report / deviations
      {{staleness_report}}
      Major deviation summary (if any): {{deviation_summary}}
      User guidance given (if any): {{user_feedback}}
      Fix loop attempts used: {{fix_attempts}} of {{max_fix_attempts}}
      ## Edge cases discovered
      {{edge_case_report}}
      ## Downstream plan updates
      {{downstream_updates}}
      ## Independent review report (if any)
      {{review_report}}
      ## Verification evidence (paste verbatim)
      ### Format
      {{format_output}}
      ### Lint
      {{lint_output}}
      ### Build
      {{build_output}}
      ### Tests
      {{tests_output}}
      ## Gate feedback
      {{handoff_fix}}
    tools:
      - fs_cat
      - fs_ls
      - fs_write
      - fs_patch
    max_iterations: 15
    output_schema:
      type: object
      properties:
        step_summary:
          type: string
          description: 3-6 sentence summary of the step for the user's approval decision - what was done, deviations, anything needing their attention
      required: [step_summary]
    fallback: end_failure
    next: check_handoff
  check_handoff:
    id: check_handoff
    type: script
    description: |
      Deterministic completion gate - handoff exists with frontmatter and all
      required sections. On success, marks the step plan status complete.
      One retry back to write_handoff, then failure.
    script: scripts/check_handoff.sh
    timeout: 10
    fallback: end_failure
  gate_user_review:
    id: gate_user_review
    type: approval
    description: The hard stop - the next step never starts without explicit approval.
    question: |
      ## Step {{step_number}} ({{step_title}}) - ready for review
      {{step_summary}}
      Handoff: {{handoff_path}}
      Build: {{build_ok}} | Tests: {{tests_ok}} | Fix attempts: {{fix_attempts}}/{{max_fix_attempts}}
      Approve this step? (Answer with anything else to send revision
      instructions straight to the implementer.)
    options:
      - "approve"
      - "revise"
    routes:
      "approve": end_success
      "revise": get_revision
    on_other: revise_from_choice
    state_updates:
      user_feedback: "{{choice}}"
  get_revision:
    id: get_revision
    type: input
    description: Collect revision instructions, then loop back through implement -> verify -> handoff.
    question: "What should change? Your comments go to the implementer verbatim."
    validation: "len(input) > 0"
    state_updates:
      fix_instructions: "{{input}}"
    next: implement
  revise_from_choice:
    id: revise_from_choice
    type: script
    description: Free-form approval answers are treated as revision instructions.
    script: scripts/revise_from_choice.sh
    timeout: 5
    fallback: get_revision
  end_success:
    id: end_success
    type: end
    output: |
      STEP_COMPLETE
      Step: {{step_number}} ({{step_title}})
      Plan: {{step_plan_path}}
      Handoff: {{handoff_path}}
      Build: passed | Tests: passed | Fix attempts: {{fix_attempts}}/{{max_fix_attempts}}
      {{step_summary}}
      Downstream plan updates:
      {{downstream_updates}}
  end_blocked:
    id: end_blocked
    type: end
    output: |
      STEP_BLOCKED
      Step: {{step_number}} ({{step_title}})
      Reason:
      {{blocking_reason}}
  end_rejected:
    id: end_rejected
    type: end
    output: |
      STEP_REJECTED
      Step: {{step_number}} ({{step_title}})
      Rejected at: deviation gate or coder approval gate.
      Deviation summary:
      {{deviation_summary}}
      Coder result (if it ran):
      {{coder_result}}
  end_failure:
    id: end_failure
    type: end
    output: |
      STEP_FAILED
      Step: {{step_number}} ({{step_title}})
      Fix attempts: {{fix_attempts}}/{{max_fix_attempts}}
      Blocking reason (if resolution failed): {{blocking_reason}}
      Coder result:
      {{coder_result}}
      Last build output:
      {{build_output}}
      Last tests output:
      {{tests_output}}
@@ -0,0 +1,54 @@
 #!/usr/bin/env bash
 set -uo pipefail
 if [[ -n "${GRAPH_STATE_FILE:-}" ]]; then
  state=$(cat "$GRAPH_STATE_FILE")
 elif [[ -n "${GRAPH_STATE:-}" ]]; then
  state="$GRAPH_STATE"
 else
  state='{}'
 fi
 handoff_path=$(echo "$state" | jq -r '.handoff_path // ""')
 step_plan_path=$(echo "$state" | jq -r '.step_plan_path // ""')
 handoff_attempts=$(echo "$state" | jq -r '.handoff_attempts // 0')
 problems=""
 if [[ ! -f "$handoff_path" ]]; then
  problems="- handoff file does not exist at $handoff_path"$'\n'
 else
  content=$(cat "$handoff_path")
  grep -qE '^result:[[:space:]]*(complete|partial|blocked)' <<< "$content" \
    || problems+="- frontmatter is missing 'result: complete|partial|blocked'"$'\n'
  for section in "Summary" "Completed" "Not completed" "Deviations" "Downstream plan updates" "Edge cases discovered" "Evidence" "Notes for next step"; do
    grep -qE "^##[[:space:]]+${section}" <<< "$content" \
      || problems+="- missing required section: ## ${section}"$'\n'
  done
 fi
 if [[ -z "$problems" ]]; then
  if [[ -f "$step_plan_path" ]]; then
    tmp=$(mktemp)
    awk 'BEGIN{n=0} /^---[[:space:]]*$/{n++; print; next} n==1 && /^status:/{print "status: complete"; next} {print}' "$step_plan_path" > "$tmp" && mv "$tmp" "$step_plan_path"
  fi
  jq -nc '{"handoff_fix": "", "_next": "gate_user_review"}'
  exit 0
 fi
 if (( handoff_attempts >= 1 )); then
  jq -nc \
    --arg br "Handoff failed validation twice. Problems:
 $problems" \
    '{"blocking_reason": $br, "_next": "end_failure"}'
  exit 0
 fi
 jq -nc \
  --arg hf "The previous handoff attempt failed validation. Fix exactly these problems:
 $problems" \
  '{
    "handoff_attempts": 1,
    "handoff_fix": $hf,
    "_next": "write_handoff"
  }'
@@ -0,0 +1,60 @@
 #!/usr/bin/env bash
 set -euo pipefail
 if [[ -n "${GRAPH_STATE_FILE:-}" ]]; then
  state=$(cat "$GRAPH_STATE_FILE")
 elif [[ -n "${GRAPH_STATE:-}" ]]; then
  state="$GRAPH_STATE"
 else
  state='{}'
 fi
 fix_attempts=$(echo "$state" | jq -r '.fix_attempts // 0')
 max_fix_attempts=$(echo "$state" | jq -r '.max_fix_attempts // 2')
 lint_ok=$(echo "$state" | jq -r '.lint_ok | if . == null then "true" else (. | tostring) end')
 build_ok=$(echo "$state" | jq -r '.build_ok | if . == null then "true" else (. | tostring) end')
 tests_ok=$(echo "$state" | jq -r '.tests_ok | if . == null then "true" else (. | tostring) end')
 lint_output=$(echo "$state" | jq -r '.lint_output // ""')
 build_output=$(echo "$state" | jq -r '.build_output // ""')
 tests_output=$(echo "$state" | jq -r '.tests_output // ""')
 if (( fix_attempts >= max_fix_attempts )); then
  jq -nc \
    --argjson n "$fix_attempts" \
    '{
      "fix_attempts": $n,
      "_next": "end_failure"
    }'
  exit 0
 fi
 next_attempts=$((fix_attempts + 1))
 if [[ "$lint_ok" != "true" ]]; then
  stage="lint"
  output="$lint_output"
 elif [[ "$build_ok" != "true" ]]; then
  stage="build"
  output="$build_output"
 elif [[ "$tests_ok" != "true" ]]; then
  stage="full test suite"
  output="$tests_output"
 else
  stage="verification"
  output="fix_loop_gate was reached but no failing stage was recorded. Re-run verification."
 fi
 fix_instructions=$(printf '## Fix loop status (step-level attempt %d of %d)\n\nThe implementation passed the coder'"'"'s internal checks but failed step-level verification at the %s stage.\n\nOutput:\n```\n%s\n```\n\nIdentify the minimal fix and apply it. Do not refactor. Regressions in untouched code caused by this change are in scope.' \
  "$next_attempts" "$max_fix_attempts" "$stage" "$output")
 jq -nc \
  --argjson n "$next_attempts" \
  --arg 'fi' "$fix_instructions" \
  '{
    "fix_attempts": $n,
    "fix_instructions": $fi,
    "lint_ok": true,
    "build_ok": true,
    "tests_ok": true,
    "_next": "implement"
  }'
@@ -0,0 +1,152 @@
 #!/usr/bin/env bash
 set -uo pipefail
 if [[ -n "${GRAPH_STATE_FILE:-}" ]]; then
  state=$(cat "$GRAPH_STATE_FILE")
 elif [[ -n "${GRAPH_STATE:-}" ]]; then
  state="$GRAPH_STATE"
 else
  state='{}'
 fi
 fail() {
  jq -nc --arg r "$1" '{"blocking_reason": $r, "_next": "end_failure"}'
  exit 0
 }
 project_dir="${LLM_AGENT_VAR_PROJECT_DIR:-.}"
 project_dir=$(cd "$project_dir" 2>/dev/null && pwd) || fail "project_dir does not exist: $project_dir"
 plans_dir="${LLM_AGENT_VAR_PLANS_DIR:-plans}"
 [[ "$plans_dir" != /* ]] && plans_dir="$project_dir/$plans_dir"
 steps_dir="$plans_dir/steps"
 handoffs_dir="$plans_dir/handoffs"
 notes_path="$plans_dir/NOTES.md"
 [[ -d "$steps_dir" ]] || fail "No step plans directory at $steps_dir (expected <plans_dir>/steps/NN-<slug>.md)"
 frontmatter() {
  awk '/^---[[:space:]]*$/{n++; next} n==1{print} n>=2{exit}' "$1"
 }
 fm_value() {
  echo "$1" | grep -E "^$2:" | head -1 | sed -E "s/^$2:[[:space:]]*//" | sed -E 's/^["'"'"']|["'"'"']$//g'
 }
 step="${LLM_AGENT_VAR_STEP:-next}"
 if [[ "$step" == "next" ]]; then
  prompt_step=$(echo "$state" | jq -r '.initial_prompt // ""' | grep -oiE 'step[[:space:]#:]*[0-9]+' | head -1 | grep -oE '[0-9]+' || true)
  [[ -n "$prompt_step" ]] && step="$prompt_step"
 fi
 plan_file=""
 if [[ "$step" == "next" ]]; then
  first_pending=""
  while IFS= read -r f; do
    st=$(fm_value "$(frontmatter "$f")" "status")
    if [[ "$st" == "in-progress" ]]; then
      plan_file="$f"
      break
    fi
    [[ -z "$first_pending" && ( "$st" == "pending" || -z "$st" ) ]] && first_pending="$f"
  done < <(find "$steps_dir" -maxdepth 1 -name '*.md' | sort)
  [[ -z "$plan_file" ]] && plan_file="$first_pending"
  [[ -z "$plan_file" ]] && fail "No in-progress or pending step plans in $steps_dir"
 else
  [[ "$step" =~ ^[0-9]+$ ]] || fail "step must be a number or 'next'; got: $step"
  padded=$(printf '%02d' "$((10#$step))")
  plan_file=$(find "$steps_dir" -maxdepth 1 \( -name "${padded}-*.md" -o -name "${step}-*.md" \) | sort | head -1)
  [[ -n "$plan_file" ]] || fail "No step plan matching step $step in $steps_dir"
 fi
 bn=$(basename "$plan_file" .md)
 num_part="${bn%%-*}"
 [[ "$num_part" =~ ^[0-9]+$ ]] || fail "Step plan filename must start with a number: $bn"
 step_number=$((10#$num_part))
 step_slug="${bn#*-}"
 fm=$(frontmatter "$plan_file")
 step_title=$(fm_value "$fm" "title")
 [[ -z "$step_title" ]] && step_title="$step_slug"
 deps=$(echo "$fm" | awk '/^depends_on:/{f=1; print; next} f && /^[[:space:]]*-/{print; next} f{exit}' | grep -oE '[0-9]+' || true)
 unsatisfied=""
 for dep in $deps; do
  dep_padded=$(printf '%02d' "$((10#$dep))")
  dep_handoff=$(find "$handoffs_dir" -maxdepth 1 \( -name "${dep_padded}-*.md" -o -name "${dep}-*.md" \) 2>/dev/null | sort | head -1)
  if [[ -z "$dep_handoff" ]]; then
    unsatisfied+="- step $dep: no handoff found (step not executed?)"$'\n'
    continue
  fi
  dep_result=$(fm_value "$(frontmatter "$dep_handoff")" "result")
  if [[ "$dep_result" != "complete" ]]; then
    unsatisfied+="- step $dep: handoff result is '$dep_result' (not complete): $dep_handoff"$'\n'
  fi
 done
 prev_handoff_path="(none)"
 prev_handoff="(none - this is the first step)"
 prev_file=""
 prev_num=0
 while IFS= read -r h; do
  hn="${h##*/}"
  hn="${hn%%-*}"
  [[ "$hn" =~ ^[0-9]+$ ]] || continue
  n=$((10#$hn))
  if (( n < step_number && n >= prev_num )); then
    prev_num=$n
    prev_file="$h"
  fi
 done < <(find "$handoffs_dir" -maxdepth 1 -name '*.md' 2>/dev/null | sort)
 if [[ -n "$prev_file" ]]; then
  prev_handoff_path="$prev_file"
  prev_handoff=$(head -c 16000 "$prev_file")
 fi
 notes="(none)"
 [[ -f "$notes_path" ]] && notes=$(head -c 8000 "$notes_path")
 step_plan=$(head -c 24000 "$plan_file")
 handoff_path="$handoffs_dir/$(basename "$plan_file")"
 tmp=$(mktemp)
 awk 'BEGIN{n=0} /^---[[:space:]]*$/{n++; print; next} n==1 && /^status:/{print "status: in-progress"; next} {print}' "$plan_file" > "$tmp" && mv "$tmp" "$plan_file"
 next_node="orient"
 blocking_reason=""
 if [[ -n "$unsatisfied" ]]; then
  next_node="gate_blocked"
  blocking_reason="Unsatisfied dependencies:"$'\n'"$unsatisfied"
 fi
 jq -nc \
  --arg pd "$project_dir" \
  --arg pl "$plans_dir" \
  --argjson sn "$step_number" \
  --arg ss "$step_slug" \
  --arg st "$step_title" \
  --arg spp "$plan_file" \
  --arg sp "$step_plan" \
  --arg php "$prev_handoff_path" \
  --arg ph "$prev_handoff" \
  --arg np "$notes_path" \
  --arg no "$notes" \
  --arg hp "$handoff_path" \
  --arg br "$blocking_reason" \
  --arg nx "$next_node" \
  '{
    "project_dir": $pd,
    "plans_dir": $pl,
    "step_number": $sn,
    "step_slug": $ss,
    "step_title": $st,
    "step_plan_path": $spp,
    "step_plan": $sp,
    "prev_handoff_path": $php,
    "prev_handoff": $ph,
    "notes_path": $np,
    "notes": $no,
    "handoff_path": $hp,
    "blocking_reason": $br,
    "_next": $nx
  }'
@@ -0,0 +1,27 @@
 #!/usr/bin/env bash
 set -euo pipefail
 if [[ -n "${GRAPH_STATE_FILE:-}" ]]; then
  state=$(cat "$GRAPH_STATE_FILE")
 elif [[ -n "${GRAPH_STATE:-}" ]]; then
  state="$GRAPH_STATE"
 else
  state='{}'
 fi
 feedback=$(echo "$state" | jq -r '.user_feedback // ""')
 if [[ -z "$feedback" ]]; then
  jq -nc '{"_next": "get_revision"}'
  exit 0
 fi
 fix_instructions=$(printf '## Revision requested by the user at the step approval gate\n\nAddress these comments with minimal edits, then the step re-verifies and the handoff is rewritten:\n\n%s' \
  "$feedback")
 jq -nc \
  --arg 'fi' "$fix_instructions" \
  '{
    "fix_instructions": $fi,
    "_next": "implement"
  }'
@@ -0,0 +1,27 @@
 #!/usr/bin/env bash
 set -euo pipefail
 if [[ -n "${GRAPH_STATE_FILE:-}" ]]; then
  state=$(cat "$GRAPH_STATE_FILE")
 elif [[ -n "${GRAPH_STATE:-}" ]]; then
  state="$GRAPH_STATE"
 else
  state='{}'
 fi
 coder_result=$(echo "$state" | jq -r '.coder_result // ""')
 case "$coder_result" in
  *CODER_COMPLETE*)
    jq -nc '{"_next": "verify_format_lint"}'
    ;;
  *CODER_REJECTED*)
    jq -nc '{"_next": "end_rejected"}'
    ;;
  *CODER_FAILED*)
    jq -nc '{"blocking_reason": "coder fix-loop exhausted; see coder result", "_next": "end_failure"}'
    ;;
  *)
    jq -nc '{"blocking_reason": "coder returned no recognizable sentinel (expected CODER_COMPLETE / CODER_REJECTED / CODER_FAILED)", "_next": "end_failure"}'
    ;;
 esac
@@ -0,0 +1,38 @@
 #!/usr/bin/env bash
 set -euo pipefail
 if [[ -n "${GRAPH_STATE_FILE:-}" ]]; then
  state=$(cat "$GRAPH_STATE_FILE")
 elif [[ -n "${GRAPH_STATE:-}" ]]; then
  state="$GRAPH_STATE"
 else
  state='{}'
 fi
 review_report=$(echo "$state" | jq -r '.review_report // ""')
 review_attempts=$(echo "$state" | jq -r '.review_attempts // 0')
 max_review_attempts=$(echo "$state" | jq -r '.max_review_attempts // 1')
 if ! grep -qF "🔴" <<< "$review_report"; then
  jq -nc '{"_next": "write_handoff"}'
  exit 0
 fi
 if (( review_attempts >= max_review_attempts )); then
  jq -nc '{"_next": "write_handoff"}'
  exit 0
 fi
 next_review=$((review_attempts + 1))
 fix_instructions=$(printf '## Independent review findings (attempt %d of %d)\n\nAn independent reviewer flagged CRITICAL (🔴) findings. Address ONLY the 🔴 findings with minimal edits. Do not refactor unrelated code.\n\n%s' \
  "$next_review" "$max_review_attempts" "$review_report")
 jq -nc \
  --argjson n "$next_review" \
  --arg 'fi' "$fix_instructions" \
  '{
    "review_attempts": $n,
    "fix_instructions": $fi,
    "needs_independent_review": false,
    "_next": "implement"
  }'
@@ -0,0 +1,23 @@
 #!/usr/bin/env bash
 set -euo pipefail
 if [[ -n "${GRAPH_STATE_FILE:-}" ]]; then
  state=$(cat "$GRAPH_STATE_FILE")
 elif [[ -n "${GRAPH_STATE:-}" ]]; then
  state="$GRAPH_STATE"
 else
  state='{}'
 fi
 has_major=$(echo "$state" | jq -r '.has_major_deviation // false')
 if [[ "${STEP_AUTOAPPROVE:-0}" == "1" ]]; then
  jq -nc '{"_next": "implement"}'
  exit 0
 fi
 if [[ "$has_major" == "true" ]]; then
  jq -nc '{"_next": "gate_deviation"}'
 else
  jq -nc '{"_next": "implement"}'
 fi
@@ -0,0 +1,23 @@
 #!/usr/bin/env bash
 set -euo pipefail
 if [[ -n "${GRAPH_STATE_FILE:-}" ]]; then
  state=$(cat "$GRAPH_STATE_FILE")
 elif [[ -n "${GRAPH_STATE:-}" ]]; then
  state="$GRAPH_STATE"
 else
  state='{}'
 fi
 needs_review=$(echo "$state" | jq -r '.needs_independent_review // false')
 if [[ "${STEP_SKIP_REVIEW:-0}" == "1" ]]; then
  jq -nc '{"_next": "write_handoff"}'
  exit 0
 fi
 if [[ "$needs_review" == "true" ]]; then
  jq -nc '{"_next": "independent_review"}'
 else
  jq -nc '{"_next": "write_handoff"}'
 fi
@@ -0,0 +1,57 @@
 #!/usr/bin/env bash
 set -uo pipefail
 # shellcheck disable=SC1091
 source "$(dirname "$0")/../../.shared/utils.sh"
 if [[ -n "${GRAPH_STATE_FILE:-}" ]]; then
  state=$(cat "$GRAPH_STATE_FILE")
 elif [[ -n "${GRAPH_STATE:-}" ]]; then
  state="$GRAPH_STATE"
 else
  state='{}'
 fi
 project_dir=$(echo "$state" | jq -r '.project_dir // "."')
 if [[ -n "${BUILD_CMD:-}" ]]; then
  cmd="$BUILD_CMD"
 else
  project_info=$(detect_project "$project_dir")
  cmd=$(echo "$project_info" | jq -r '.check // .build // ""')
 fi
 if [[ -z "$cmd" || "$cmd" == "null" ]]; then
  jq -nc '{
    "build_ok": true,
    "build_output": "(no build/check command available for this project type)",
    "_next": "verify_tests"
  }'
  exit 0
 fi
 exit_code=0
 output=$(cd "$project_dir" && eval "$cmd" 2>&1) || exit_code=$?
 if (( exit_code == 0 )); then
  jq -nc \
    --arg out "Ran: $cmd
 $output" \
    '{
      "build_ok": true,
      "build_output": $out,
      "_next": "verify_tests"
    }'
 else
  jq -nc \
    --arg out "Ran: $cmd
 Exit code: $exit_code
 $output" \
    '{
      "build_ok": false,
      "build_output": $out,
      "_next": "fix_loop_gate"
    }'
 fi
@@ -0,0 +1,79 @@
 #!/usr/bin/env bash
 set -uo pipefail
 # shellcheck disable=SC1091
 source "$(dirname "$0")/../../.shared/utils.sh"
 if [[ -n "${GRAPH_STATE_FILE:-}" ]]; then
  state=$(cat "$GRAPH_STATE_FILE")
 elif [[ -n "${GRAPH_STATE:-}" ]]; then
  state="$GRAPH_STATE"
 else
  state='{}'
 fi
 project_dir=$(echo "$state" | jq -r '.project_dir // "."')
 project_type=$(detect_project "$project_dir" | jq -r '.type // "unknown"')
 format_cmd="${FORMAT_CMD:-}"
 if [[ -z "$format_cmd" ]]; then
  case "$project_type" in
    rust) format_cmd="cargo fmt" ;;
    go) format_cmd="gofmt -w ." ;;
    python) command -v ruff &>/dev/null && format_cmd="ruff format ." ;;
  esac
 fi
 if [[ -z "$format_cmd" ]]; then
  format_output="(no format command configured for project type '$project_type'; skipped. Set FORMAT_CMD to enable.)"
 else
  fmt_rc=0
  fmt_out=$(cd "$project_dir" && eval "$format_cmd" 2>&1) || fmt_rc=$?
  format_output="Ran: $format_cmd
 Exit code: $fmt_rc
 $fmt_out"
 fi
 lint_cmd="${LINT_CMD:-}"
 if [[ -z "$lint_cmd" ]]; then
  jq -nc \
    --arg fo "$format_output" \
    '{
      "format_output": $fo,
      "lint_ok": true,
      "lint_output": "(no LINT_CMD configured; linting is covered by the build/check command)",
      "_next": "verify_build"
    }'
  exit 0
 fi
 lint_rc=0
 lint_out=$(cd "$project_dir" && eval "$lint_cmd" 2>&1) || lint_rc=$?
 if (( lint_rc == 0 )); then
  jq -nc \
    --arg fo "$format_output" \
    --arg lo "Ran: $lint_cmd
 $lint_out" \
    '{
      "format_output": $fo,
      "lint_ok": true,
      "lint_output": $lo,
      "_next": "verify_build"
    }'
 else
  jq -nc \
    --arg fo "$format_output" \
    --arg lo "Ran: $lint_cmd
 Exit code: $lint_rc
 $lint_out" \
    '{
      "format_output": $fo,
      "lint_ok": false,
      "lint_output": $lo,
      "_next": "fix_loop_gate"
    }'
 fi
@@ -0,0 +1,57 @@
 #!/usr/bin/env bash
 set -uo pipefail
 # shellcheck disable=SC1091
 source "$(dirname "$0")/../../.shared/utils.sh"
 if [[ -n "${GRAPH_STATE_FILE:-}" ]]; then
  state=$(cat "$GRAPH_STATE_FILE")
 elif [[ -n "${GRAPH_STATE:-}" ]]; then
  state="$GRAPH_STATE"
 else
  state='{}'
 fi
 project_dir=$(echo "$state" | jq -r '.project_dir // "."')
 if [[ -n "${TEST_CMD:-}" ]]; then
  cmd="$TEST_CMD"
 else
  project_info=$(detect_project "$project_dir")
  cmd=$(echo "$project_info" | jq -r '.test // ""')
 fi
 if [[ -z "$cmd" || "$cmd" == "null" ]]; then
  jq -nc '{
    "tests_ok": true,
    "tests_output": "(no test command available for this project type)",
    "_next": "edge_case_sweep"
  }'
  exit 0
 fi
 exit_code=0
 output=$(cd "$project_dir" && eval "$cmd" 2>&1) || exit_code=$?
 if (( exit_code == 0 )); then
  jq -nc \
    --arg out "Ran: $cmd
 $output" \
    '{
      "tests_ok": true,
      "tests_output": $out,
      "_next": "edge_case_sweep"
    }'
 else
  jq -nc \
    --arg out "Ran: $cmd
 Exit code: $exit_code
 $output" \
    '{
      "tests_ok": false,
      "tests_output": $out,
      "_next": "fix_loop_gate"
    }'
 fi
@@ -5116,6 +5116,45 @@ mod tests {
        assert!(paths::skill_file("frontend-ui-ux").exists());
    }
    #[test]
    #[serial]
    fn bundled_graph_agents_parse_and_validate() {
        use crate::graph::GraphParser;
        use crate::graph::validator::GraphValidator;
        let _guard = TestConfigDirGuard::new();
        Agent::install_builtin_agents(false).unwrap();
        Skill::install_builtin_skills(false).unwrap();
        let mut checked = Vec::new();
        for entry in std::fs::read_dir(paths::agents_data_dir()).unwrap() {
            let dir = entry.unwrap().path();
            let graph_path = dir.join("graph.yaml");
            if !graph_path.exists() {
                continue;
            }
            let name = dir.file_name().unwrap().to_string_lossy().to_string();
            let graph = GraphParser::new(&dir)
                .load_from_file(&graph_path)
                .unwrap_or_else(|e| panic!("graph.yaml for '{name}' failed to parse: {e}"));
            let result = GraphValidator::new(&dir).validate(&graph);
            assert!(
                result.errors.is_empty(),
                "graph.yaml for '{name}' failed validation: {:#?}",
                result.errors
            );
            checked.push(name);
        }
        checked.sort();
        for expected in ["coder", "librarian", "step-runner"] {
            assert!(
                checked.iter().any(|n| n == expected),
                "expected bundled graph agent '{expected}' to be checked; found {checked:?}"
            );
        }
    }
    #[test]
    #[serial]
    fn install_functions_force_preserves_user_mcp_json() {