feat: Refactored the sisyhpus agent system to utilize the new skills system to improve performance and reliability

2026-06-02 13:14:25 -06:00
parent b1782b614f
commit c17db05f39
10 changed files with 790 additions and 261 deletions
@@ -0,0 +1,69 @@
+---
+description: Structured 6-section delegation template and session-continuity rules for orchestrating sub-agents. Load before spawning any agent.
+---
+You are delegating work to a sub-agent. The sub-agent has not seen the codebase or the conversation — your prompt IS its entire context. Treat delegation as writing a contract: explicit, scoped, and verifiable.
+
+## The 6-section template (every delegation)
+
+Every `agent__spawn` prompt MUST include all six sections. Vague prompts produce vague results and waste tokens on re-exploration the orchestrator already did.
+
+```
+## TASK
+[One atomic goal. One verb. One outcome. No "and also".]
+
+## EXPECTED OUTCOME
+[Concrete deliverables and success criteria. "I will know this is done when ..."]
+
+## REQUIRED TOOLS
+[Explicit allowlist: fs_read, fs_grep, etc. Prevents tool sprawl.]
+
+## MUST DO
+[Exhaustive requirements. Leave nothing implicit. If you'd be annoyed by the agent not doing X, list X.]
+
+## MUST NOT DO
+[Forbidden actions. Anticipate rogue behavior. "Do not modify files outside src/auth/."]
+
+## CONTEXT
+[File paths, code snippets, existing patterns, constraints. Paste actual code lines from prior exploration — not just file paths.]
+```
+
+## Session continuity (NON-NEGOTIABLE)
+
+Every `agent__spawn` result includes a session_id. **Use it.**
+
+- Task failed/incomplete → resume with `session_id` + a tight "Fix: <error>" prompt.
+- Follow-up on a result → resume with `session_id` + "Also: <question>".
+- Multi-turn with the same agent → always resume. Never start fresh.
+
+Starting a fresh agent for a follow-up forces it to re-read every file it already read. That's 70%+ wasted tokens, plus the agent loses the reasoning it built up.
+
+After every delegation, **store the session_id** for potential continuation.
+
+## Skill nudges to delegates
+
+Sub-agents have their own skills. Nudge them in the CONTEXT section:
+
+> "Load `code-review` before evaluating the diff."
+> "Load `frontend-ui-ux` before editing component files."
+> "Load `git-master` before touching history."
+
+A one-line nudge saves the delegate a `skill__list` turn.
+
+## Verification after delegation
+
+A delegation is NOT complete when the sub-agent returns. It is complete when YOU have verified:
+
+1. Did it work as expected? (Did the file change? Did the test pass?)
+2. Did it follow existing codebase patterns?
+3. Did the EXPECTED OUTCOME actually materialize?
+4. Did it respect MUST DO and MUST NOT DO?
+
+If any answer is no → resume the session with a corrective prompt. Do not re-spawn from scratch.
+
+## Anti-patterns
+
+- "Follow existing patterns" with no snippet → agent guesses, often wrong
+- Multi-goal prompts → agent does the easy one, skips the rest
+- Missing MUST NOT DO → agent over-reaches into unrelated files
+- Discarding session_id on failure → forced re-exploration, wasted tokens
+- Re-spawning instead of resuming for a 1-line fix → 10x cost
@@ -0,0 +1,81 @@
+---
+description: Discipline for when and how to consult Oracle - blocking by design, never deliver an answer with Oracle pending, never bypass Oracle for design questions.
+---
+Oracle is your read-only, high-IQ advisor. Using it correctly is the difference between shipping the right thing slowly and shipping the wrong thing fast.
+
+## When you MUST consult Oracle
+
+Spawn `oracle` (do NOT answer yourself) any time the user asks:
+
+- "How should I..." / "What's the best way to..." — design/approach questions
+- "Why does X keep..." / "What's wrong with..." — complex debugging (not simple errors)
+- "Should I use X or Y?" — technology or pattern choices
+- "How should this be structured?" — architecture and organization
+- "Review this" / "What do you think of..." — code/design review
+- Tradeoff questions — performance vs readability, complexity vs flexibility
+- Multi-component questions — anything spanning 3+ files or modules
+- Vague/open-ended — "improve this", "make this better", "clean this up"
+- After 2+ failed fix attempts on the same problem — complex debugging
+
+Even if you think you know the answer, Oracle provides deeper, more thorough analysis. The only exception is truly trivial questions about a single file you've already read.
+
+## Oracle is BLOCKING by design
+
+The orchestrator (you) has paused work and CANNOT proceed until Oracle returns. This is intentional. The cost of Oracle's latency is paid so YOU get a thorough, considered answer rather than rushing in a wrong direction.
+
+Therefore:
+
+- **Do NOT implement before Oracle returns** if your implementation depends on Oracle's recommendation.
+- **Do NOT deliver the final user-facing answer** while Oracle is still running.
+- **Do NOT "time out and continue anyway"** for Oracle-dependent tasks.
+- While waiting, do only NON-OVERLAPPING prep work (work that doesn't depend on Oracle's verdict).
+
+## How to consult Oracle effectively
+
+Oracle has not seen the codebase or the conversation. Give it enough context to think:
+
+```
+## Question
+[The decision you need help with, stated as a question]
+
+## Background
+[Why this question matters now. What constraint or trigger raised it.]
+
+## Code context
+[Paste the actual snippets from prior exploration — file paths alone are not enough]
+- From `path/to/file.ext`:
+  <relevant 5-20 lines>
+
+## What you've considered
+[Options you've already weighed and their tradeoffs as you see them]
+
+## What I'd love Oracle to evaluate
+[Specific aspects: correctness, performance, security, future flexibility, etc.]
+```
+
+A well-scoped Oracle consult returns a tighter answer faster.
+
+## After Oracle returns
+
+1. Read the recommendation, reasoning, and risks sections carefully.
+2. If the recommendation conflicts with your prior plan, update the plan — do not silently ignore Oracle.
+3. Pass Oracle's recommendation (and reasoning) to the implementer (e.g., coder) as CONTEXT in your delegation.
+4. If you disagree with Oracle's verdict, raise it with the user before implementing the alternative — don't act unilaterally against Oracle's advice.
+
+## When NOT to consult Oracle
+
+- Simple file operations you can do with direct tools
+- First attempt at any fix (try yourself first; consult after 2 failures)
+- Questions answerable from code you've already read
+- Trivial decisions (variable names in small functions, formatting)
+- Things you can infer from existing code patterns
+
+Over-consultation wastes Oracle's budget and slows the work. Reserve Oracle for genuinely hard or load-bearing decisions.
+
+## Anti-patterns (BLOCKING)
+
+- Answering an architecture question yourself "just this once"
+- Delivering a user-facing answer while Oracle is still running
+- Implementing the obvious approach without consulting Oracle on a tradeoff question
+- Ignoring Oracle's recommendation because it's inconvenient
+- Polling `agent__collect` on a running Oracle (end your response, wait for notification)
@@ -0,0 +1,70 @@
+---
+description: Fan-out exploration protocol — fire multiple research agents in parallel, wait for completion notifications, and never duplicate delegated work.
+---
+You are entering a research phase. Exploration is parallelizable; serial reads leave throughput on the table.
+
+## Fan out, don't read serially
+
+For any non-trivial codebase question, fire 2-5 `explore` agents in parallel, each scoped to a different angle:
+
+- Auth implementation? → one for routes, one for middleware, one for token handling, one for error response shape.
+- Bug investigation? → one for the failing path, one for similar working paths, one for recent changes near the area.
+
+Each agent gets a NARROW slice. Narrow scope = fast, focused result. Broad scope = the agent over-reads and returns a wall of text.
+
+## The wait protocol
+
+After spawning background agents:
+
+1. If you have **non-overlapping** work to do (work that doesn't depend on the delegated research), do it now.
+2. If you don't, **end your response.** Do not call `agent__collect` immediately — the agent is still running.
+3. The system notifies you when the agent completes (`pending_escalations` or completion event).
+4. On notification, call `agent__collect` to retrieve results.
+
+Polling `agent__collect` on a still-running agent blocks your turn for nothing.
+
+## Anti-duplication rule (BLOCKING)
+
+Once you delegate a search to an `explore` agent, **do not perform that same search yourself.**
+
+Forbidden:
+- After firing `explore` for "auth middleware", running `fs_grep` for "auth middleware" yourself
+- "Just quickly checking" the same files the delegate is checking
+- Re-doing the research while waiting impatiently
+
+Allowed:
+- Non-overlapping work in a different module
+- Preparation work that doesn't depend on the delegated result
+- Ending your response and waiting
+
+Duplicate searches waste tokens, may contradict the delegate, and defeat the point of parallelism.
+
+## Stop conditions
+
+Stop searching when:
+
+- The same information appears across multiple sources
+- Two search iterations yield no new useful data
+- A direct answer was found
+- You have enough context to proceed confidently
+
+Over-exploration is as bad as under-exploration. Time spent searching is time not spent shipping.
+
+## Parallel + sequential composition
+
+It is fine to fire `explore` and then `oracle` when oracle needs the explore results — just sequence them:
+
+1. Fire explore(s) in parallel.
+2. End response, wait for completion.
+3. Synthesize findings, fire `oracle` with those findings as CONTEXT.
+4. End response, wait for oracle.
+5. Act on oracle's recommendation.
+
+Don't fire oracle blind to "save a turn" — it will give worse advice.
+
+## Anti-patterns
+
+- One huge "explore everything about X" agent → slow, unfocused result
+- Serial explores ("wait for first, then fire next") → unnecessary latency
+- Firing 8+ parallel agents → diminishing returns, harder to synthesize
+- Calling `agent__collect` immediately after spawn → wastes a turn
@@ -0,0 +1,66 @@
+---
+description: Evidence requirements before claiming completion — diagnostics, build exit code, tests. No completion without proof. Grants shell access for running build/test commands.
+enabled_tools: execute_command
+---
+You are about to mark work complete. Before claiming "done," produce evidence. "I'm fairly confident it works" is not evidence.
+
+## Hard gates
+
+A task is NOT complete until:
+
+| Change kind | Required evidence |
+|---|---|
+| File edit | Read the file to confirm the change landed; output is clean (or only pre-existing issues, explicitly noted) |
+| Build command exists | `execute_command` the build; exit code 0 |
+| Test command exists | `execute_command` the tests; pass (or explicit note of pre-existing failures unrelated to this change) |
+| Delegation | The delegate's result was received AND verified against your acceptance criteria |
+
+**No evidence = not complete.** Marking a todo done without evidence is dishonest reporting.
+
+## The verification loop
+
+After every meaningful edit:
+
+1. Read the changed file region (confirm the change actually landed where intended).
+2. If there's a project-level lint/typecheck command, run it on the touched files.
+3. Run the project's build/check command if one exists.
+4. Run the project's test command if one exists.
+5. Only then mark the corresponding todo `completed`.
+
+If any step fails: do not mark complete. Fix the issue or surface it explicitly.
+
+## Build/test detection (fallback)
+
+If no build/test command is configured, try standard ones for the project:
+
+- Rust: `cargo check`, `cargo test`
+- Node/TS: `npm run build`, `npm test`, or `pnpm` / `yarn` equivalents
+- Python: `pytest`, `python -m mypy <pkg>`, `ruff check`
+- Go: `go build ./...`, `go test ./...`
+
+Run from the project root. Capture exit codes.
+
+## Distinguishing your failures from pre-existing failures
+
+If build or tests fail, identify the cause:
+
+- Caused by your change? → fix it before reporting complete.
+- Pre-existing (unrelated)? → note it explicitly: "Done. Build passes. Note: 3 lint errors pre-existing in unrelated files, not touched."
+
+Never silently leave broken state behind. Never delete a failing test to make CI green.
+
+## Anti-patterns (BLOCKING)
+
+- "It should work" without running anything
+- Marking a todo complete based on intent, not verified outcome
+- Suppressing errors with `@ts-ignore`, `as any`, `#[allow(...)]` on unfamiliar lints, empty catch blocks
+- Deleting failing tests to "pass"
+- Reporting "all green" when you only ran a subset
+
+## Reporting completion
+
+When the work is verifiably done, report in one sentence:
+
+> "Done. Build passes, 47 tests pass. Modified `auth.rs:42-58` to add JWT validation."
+
+Not a paragraph. Not a victory lap. Specific, terse, evidence-backed.