feat: harden autonomy lifecycle, OOM bounds, and provider-boundary finalization

This PR consolidates a coordinated batch of fixes around autonomy run/flow lifecycle, scheduled task deduplication, provider-boundary state finalization, and matching memory-bound treatments for adjacent long-running subsystems (REPL fullscreen scrollback, skill-search/skill-learning runtime activation). All changes were developed and reviewed together because they touched the same lifecycle invariants and were uncovered by the same long-running session reproductions. ## Lifecycle correctness - Queued autonomy prompts are not injected unless the persisted run was successfully claimed; queued run claiming is now terminal-safe so a once-consumed/cancelled/failed run can not slip back into `queued`. - Autonomy run/flow finalization happens on completion, provider error, generator close, and cancellation — not just the happy path. New `src/__tests__/queryAutonomyProviderBoundary.test.ts` covers these provider-boundary transitions. - `requestManagedAutonomyFlowCancel` and `resumeManagedAutonomyFlowPrompt` carry `rootDir` and `currentDir` explicitly across detached async boundaries (proactive-tick, cron, daemon restart) instead of inferring from process state. - Active runs/flows are protected from janitor pruning so a running step can not be garbage-collected mid-flight (`src/utils/autonomyAuthority.ts`). - Heartbeat parser now ignores fenced code blocks; the two-phase commit window for autonomy state transitions is documented in `docs/internals/autonomy-jira.md`. ## Ownership and dedup - `src/utils/autonomyRuns.ts`: ownership stamping (run id + rootDir carried end-to-end), source-based dedup against active runs. - `src/hooks/useScheduledTasks.ts`: scheduled ticks deduplicate against runs already active on the same source label. - `src/utils/processUserInput/processSlashCommand.tsx`: forked slash commands now thread the autonomy `runId` so completion finalizers can find the originating run for deferred completion. - New `src/utils/autonomyQueueLifecycle.ts` and tests collect the queue-side lifecycle invariants in one place. ## Memory bounds (related, same review pass) - `src/screens/REPL.tsx`: caps fullscreen scrollback after the compact boundary and updates trailing progress rows in place. Long-running fullscreen sessions could otherwise retain thousands of post-compaction messages and duplicate progress rows, keeping Ink trees alive long after their useful context had moved on. - `src/services/skillSearch/*` and `src/services/skillLearning/*`: runtime activation is strictly opt-in via existing env toggles; session caches are capped so long-running processes can not grow them forever. Build presence is preserved so operators can still discover and opt into the slash commands. ## CI / test contract - `tests/integration/dependency-overrides.test.ts`: smoke test no longer drives Mermaid's browser renderer; it validates the package-resolution contract directly so CI does not regress on unrelated browser timing. - New `tests/integration/autonomy-lifecycle-user-flow.test.ts`: end-to-end CLI subprocess flow exercising `status --deep`, `flows`, `flow <id>`, `flow resume`, `flow cancel` against persisted state. - `src/entrypoints/cli.tsx`: `claude autonomy …` routes through an entrypoint fast path that reuses the slash-command formatter without booting the full interactive CLI. Stdout is flushed before forced exit so coverage subprocesses do not terminate with empty stdout. - `packages/builtin-tools/src/tools/RemoteTriggerTool/__tests__/RemoteTriggerTool.test.ts`: stabilized to prevent audit flake under coverage. ## Tests added - `src/__tests__/queryAutonomyProviderBoundary.test.ts` - `src/hooks/__tests__/useScheduledTasks.test.ts` - `src/utils/__tests__/autonomyAuthority.test.ts` - `src/utils/__tests__/autonomyFlows.test.ts` (extended) - `src/utils/__tests__/autonomyPersistence.test.ts` (extended) - `src/utils/__tests__/autonomyQueueLifecycle.test.ts` - `src/utils/__tests__/autonomyRuns.test.ts` (extended) - `src/utils/processUserInput/__tests__/processSlashCommand.test.ts` - `tests/integration/autonomy-lifecycle-user-flow.test.ts` ## Docs - `docs/agent/sur-loop-scheduled-oom.md`: System Understanding Report covering the scheduled/loop OOM problem, the call graphs investigated, and the lifecycle invariants this PR establishes. - `docs/agent/sur-skill-overflow-bugs.md`: SUR for the related skill-overflow context. - `docs/internals/autonomy-jira.md`: documents the two-phase commit window and ownership stamping invariants. - `docs/memory-leak-audit.md`: audit notes covering the REPL/scrollback and skill-search bounds. ## Invariants this PR establishes 1. Queued autonomy prompts are not injected unless the persisted run was successfully claimed. 2. Terminal run/flow states are terminal — completion, failure, and cancellation all finalize state regardless of which provider/error path triggered them. 3. Autonomy run/flow `rootDir` is carried explicitly across detached async boundaries instead of inferred from a shared singleton. 4. State-only CLI subcommands (`autonomy status|runs|flows|flow …`) bypass full interactive bootstrap so they do not hold unrelated handles open. 5. REPL fullscreen scrollback and skill-search/skill-learning session caches are explicitly bounded. ## Validation ```bash bun run typecheck CI=true GITHUB_ACTIONS=true bun test # 3996 pass / 0 fail across 305 files bun test src/__tests__/queryAutonomyProviderBoundary.test.ts \ src/hooks/__tests__/useScheduledTasks.test.ts \ src/utils/__tests__/autonomy{Runs,Flows,Authority,QueueLifecycle,Persistence}.test.ts \ src/utils/processUserInput/__tests__/processSlashCommand.test.ts \ tests/integration/autonomy-lifecycle-user-flow.test.ts ``` ## Origin This PR is the consolidated, upstream-targeted version of two fork-side review PRs (fix/loop-scheduled-autonomy-oom and fix/autonomy-lifecycle). The fork-side review history is preserved at https://github.com/amDosion/claude-code-bast/pull/7 . The fork's own internal `chore: keep fork current with upstream` sync commits and the `docs: update contributors` automation are intentionally not included in this PR. The autonomy CLI handler `rootDir` threading that the fork added (78f64d8a, 98d04ddb) is intentionally omitted here because upstream `a2cfaf91` (fix: 修复 RemoteTriggerTool 和 autonomy 测试的全量运行失败) already performed the equivalent change with an additional `currentDir` option. Keeping the upstream version avoids regressing that improvement.
2026-06-15 12:55:51 +00:00 · 2026-04-29 14:04:27 +08:00
parent 4f1649e249
commit f2e9af4927
51 changed files with 4885 additions and 971 deletions
--- a/docs/agent/sur-loop-scheduled-oom.md
+++ b/docs/agent/sur-loop-scheduled-oom.md
@@ -0,0 +1,492 @@
+# System Understanding Report — Loop / Scheduled Autonomy OOM
+
+- **Flow id**: `recurring-bug-loop-oom` (pilot flow for autonomy ↔ deep-debug binding)
+- **Branch**: `fix/loop-scheduled-autonomy-oom`
+- **Worktree**: `E:\Source_code\Claude-code-bast-loop-scheduled-oom-fix`
+- **Author**: back-filled from existing working-tree diff (no commits ahead of `main`)
+- **Status**: `report` (this document) — pending human approval before `regression-test` advances
+
+---
+
+## 1. Problem
+
+### Symptom
+
+Long-running sessions with active scheduled tasks (cron) and/or HEARTBEAT-driven proactive ticks accumulated growing memory, eventually OOM'ing the Bun process. The visible signature was:
+
+- `runs.json` under `.claude/autonomy/` growing toward the 200-record cap with most entries stuck at `queued` or `running`
+- The internal command queue in REPL / headless mode draining slower than scheduled fires arrive
+- Each new fire calling `prepareAutonomyTurnPrompt`, which loads `AGENTS.md` + `HEARTBEAT.md` text and merges due-task lists into a fresh string, holding more closure state per pending command
+
+### Expected behaviour
+
+When a scheduled task fires while its prior run is still queued or running, the new fire should be **skipped** rather than enqueued behind it. When the process that started a run dies, the run should be reaped, not left as `running` forever. Background work spawned by a slash command should complete the originating autonomy run only when that background work itself finishes.
+
+### Actual behaviour (before fix)
+
+1. `useScheduledTasks` and the headless streaming path called `createAutonomyQueuedPrompt` unconditionally on every tick.
+2. `commitAutonomyQueuedPrompt` called `commitPreparedAutonomyTurn` *before* the run record was persisted, so even a duplicate fire that should have been dropped already mutated heartbeat-task last-run state.
+3. `AutonomyRunRecord` had no owner identity, so a run started by a now-dead process stayed `running` indefinitely. Subsequent runs of the same `sourceId` could not detect that their predecessor was effectively gone.
+4. Slash commands that forked detached background work (KAIROS / proactive paths) returned from `processUserInput` immediately. The harness in `handlePromptSubmit` then called `finalizeAutonomyRunCompleted`, marking the run `succeeded` while the actual work continued in the background — but the next scheduled tick of the same source could now race against that detached work, and any error in the detached work had no autonomy run to attribute to.
+
+### Reproduction shape
+
+Not a single deterministic repro — load-induced. Rough recipe:
+
+- Configure two `HEARTBEAT.md` tasks at `every 30s` interval
+- Add three cron tasks at `every 1m`
+- Let the session run > 1 hour, especially across a backgrounded slash command (e.g. KAIROS `/sleep`-style detached fork)
+- Watch `.claude/autonomy/runs.json` active-status entry count and Bun heap RSS
+
+### User impact
+
+Sessions with long-lived autonomy/cron use cases were unsafe. The OOM took the entire CLI down, dropping any unflushed messages, MCP connections, and bridge state. Because `.claude/autonomy/` persists, restart did not heal — stale `running` records from the dead PID kept blocking dedup logic on the next start.
+
+---
+
+## 2. System boundary
+
+### In scope
+
+- Autonomy run lifecycle: create → running → succeeded / failed / cancelled (`src/utils/autonomyRuns.ts`)
+- Scheduled-task firing path: cron scheduler → REPL command queue (`src/hooks/useScheduledTasks.ts`)
+- Headless streaming variant of the same path (`src/cli/print.ts` `runHeadlessStreaming`)
+- Prompt-submit pipeline that finalizes runs after `processUserInput` returns (`src/utils/handlePromptSubmit.ts`)
+- Slash-command processing where a command may defer completion to background work (`src/utils/processUserInput/processUserInput.ts`, `processSlashCommand.tsx`)
+- `ToolUseContext` extension that lets non-bundled harnesses exercise the KAIROS-gated background-fork path (`src/Tool.ts`)
+
+### Out of scope
+
+- The cron scheduler itself (`src/utils/cronScheduler.ts`) — its tick semantics are not changing
+- `autonomyFlows.ts` flow state machine — separate from per-run tracking
+- HEARTBEAT.md scheduling semantics — unchanged. `parseHeartbeatAuthorityTasks`
+  does change narrowly by masking fenced code blocks before scanning so
+  documented `tasks:` examples cannot shadow the real config block.
+- `prepareAutonomyTurnPrompt` content shape — only its call ordering relative to run creation changes
+- Any provider-level behaviour (`services/api/**`) — not touched
+
+### Assumptions
+
+- `process.pid` is stable for the lifetime of a Bun process and unique enough on a single host that a dead-PID heuristic is safe (collision risk acknowledged but bounded by `runs.json` retention).
+- `isProcessRunning(pid)` (from `genericProcessUtils.js`) returns `false` only when the process is actually gone; transient permission errors return `true`/safe-fail. Verified in step 6.
+- `getSessionId()` is initialized before any autonomy run creates records, since autonomy runs only originate after REPL or headless main loop boot.
+
+---
+
+## 3. Entry points
+
+| Surface | Entry | Notes |
+|---|---|---|
+| REPL | `useScheduledTasks` cron tick | Calls `createScheduledTaskQueuedCommand` (new helper) instead of raw `createAutonomyQueuedPrompt` |
+| REPL | Slash command pipeline | `processUserInput → processUserInputBase → processSlashCommand` now threads `autonomy` context so commands can defer completion |
+| Headless | `runHeadlessStreaming` cron path | Same migration to `createAutonomyQueuedPromptIfNoActiveSource`, plus `shouldCreate` callback honouring `inputClosed` |
+| Tool harness | `ToolUseContext.options.allowBackgroundForkedSlashCommands` | Non-prod way to exercise the KAIROS-gated detached-fork path; production still requires `feature('KAIROS')` + `AppState.kairosEnabled` |
+| Persistence | `.claude/autonomy/runs.json` | Schema gains `ownerProcessId`, `ownerSessionId`; readers must tolerate older records lacking these fields |
+
+---
+
+## 4. Key files
+
+| File | Lines changed | Why it matters |
+|---|---|---|
+| `src/utils/autonomyRuns.ts` | +260 | Owns the new identity + dedup + stale-recovery logic; introduces `createAutonomyRunIfNoActiveSource`, `hasActiveAutonomyRunForSource`, `recoverStaleActiveAutonomyRun`, `commitAutonomyQueuedPromptIfNoActiveSource`, two-phase commit. The structural heart of the fix. |
+| `src/utils/processUserInput/processSlashCommand.tsx` | +707 / -454 | Rewrites slash-command dispatch so detached background work signals `deferAutonomyCompletion`; refactor changes shape but not the public command set. |
+| `src/hooks/useScheduledTasks.ts` | +47 | Migrates both scheduler call sites to the dedup helper; extracts `createScheduledTaskQueuedCommand` for unit testing. |
+| `src/cli/print.ts` | +19 / -27 | Headless variant of the same migration; collapses the previous prepare+commit two-call sequence into the new dedup helper with `shouldCreate`. |
+| `src/utils/handlePromptSubmit.ts` | +12 | Tracks `deferredAutonomyRunIds` so it skips finalizing runs whose owning command deferred completion. |
+| `src/utils/processUserInput/processUserInput.ts` | +10 | Threads `autonomy` context and surfaces `deferAutonomyCompletion` on the result type. |
+| `src/Tool.ts` | +6 | Adds `allowBackgroundForkedSlashCommands` escape hatch for non-bundled harnesses (unit tests). |
+| `src/utils/__tests__/autonomyRuns.test.ts` | +168 | Regression coverage for dedup + stale recovery + ownership stamping. |
+| `src/hooks/__tests__/useScheduledTasks.test.ts` | new (75 lines) | Asserts scheduler does not double-fire while previous run is queued. |
+| `src/utils/processUserInput/__tests__/processSlashCommand.test.ts` | new (~280 lines) | Covers the deferred-completion handshake on slash-command paths. |
+
+---
+
+## 5. Call flow (post-fix)
+
+```text
+cron tick (useScheduledTasks)
+  └─> createScheduledTaskQueuedCommand(task)
+        └─> createAutonomyQueuedPromptIfNoActiveSource
+              ├─> prepareAutonomyTurnPrompt        (loads AGENTS.md + HEARTBEAT.md)
+              ├─> shouldCreate?  ──► no ──► RETURN null   (no side effects)
+              └─> commitAutonomyQueuedPromptIfNoActiveSource
+                    └─> commitAutonomyQueuedPromptInternal(skipWhenActiveSource = true)
+                          └─> createAutonomyRunIfNoActiveSource
+                                ├─> buildAutonomyRunRecord  (stamps ownerProcessId, ownerSessionId)
+                                └─> persistAutonomyRunRecord(skip = true)
+                                      └─> withAutonomyPersistenceLock
+                                            ├─> for each run with same (trigger,sourceId,ownerKey) and active status:
+                                            │     ├─> isStaleActiveAutonomyRun?  ──► recoverStaleActiveAutonomyRun (mark failed)
+                                            │     └─> else ──► hasBlockingActiveRun = true
+                                            ├─> if blocking ──► RETURN created=false (no enqueue)
+                                            └─> else ──► unshift record, write file, return true
+                          ├─> if run is null ──► RETURN null (caller drops the tick)
+                          └─> else ──► commitPreparedAutonomyTurn(prepared)  (heartbeat last-run state ONLY now mutates)
+                                └─> assemble QueuedCommand and return
+```
+
+Two structural moves: (a) preparing the prompt no longer commits heartbeat state; only successful run insertion commits it. (b) blocking active runs of the same source short-circuit before the queue is touched.
+
+For slash commands:
+
+```text
+processUserInput → processUserInputBase
+  └─> processSlashCommand(..., autonomy = cmd.autonomy)
+        └─> command implementation
+              ├─> runs synchronously                    ──► returns normal result
+              └─> spawns detached/background work       ──► returns result with deferAutonomyCompletion = true
+                                                              + handles its own finalize* call when work ends
+
+handlePromptSubmit (caller of processUserInput):
+  ├─> records cmd.autonomy.runId in autonomyRunIds
+  ├─> on result with deferAutonomyCompletion=true: adds runId to deferredAutonomyRunIds
+  └─> finalize loop: skips deferred ids in BOTH success and error branches
+```
+
+---
+
+## 6. Data flow
+
+### `runs.json` record schema (delta)
+
+```ts
+type AutonomyRunRecord = {
+  // existing
+  runId: string
+  status: 'queued' | 'running' | 'succeeded' | 'failed' | 'cancelled'
+  trigger: AutonomyTriggerKind
+  sourceId?: string
+  ownerKey?: string
+  // new
+  ownerProcessId?: number     // process.pid at create time and at markRunning time
+  ownerSessionId?: string     // getSessionId() at the same points
+  // ...
+}
+```
+
+Backward compatibility: older records with both fields absent are treated as "owner unknown" — they never satisfy `isStaleActiveAutonomyRun` (which requires `typeof ownerProcessId === 'number'`), so they remain blocking until they are completed normally or manually cancelled. This is intentional: we cannot prove they are stale.
+
+### Stale-recovery rule
+
+```text
+isStaleActiveAutonomyRun(run) ⇔
+    run.status ∈ {queued, running}
+  ∧ typeof run.ownerProcessId === 'number'
+  ∧ !isProcessRunning(run.ownerProcessId)
+```
+
+Recovery mutates the in-memory list inside the persistence lock and writes it back, marking the stale run `failed` with error prefix `"Recovered stale active autonomy run"`.
+
+### Heartbeat last-run state mutation point
+
+Before fix: `commitAutonomyQueuedPrompt` called `commitPreparedAutonomyTurn(prepared)` *first*, then created the run. A skipped duplicate already advanced heartbeat last-run timestamps.
+
+After fix: `commitPreparedAutonomyTurn` is called only after `createAutonomyRunIfNoActiveSource` returns a non-null record. Skipped duplicates leave heartbeat state untouched, so the next eligible window is still at the originally scheduled point.
+
+---
+
+## 7. State model
+
+### Run status lifecycle (unchanged at edges, tightened in the middle)
+
+```text
+queued ──► running ──► succeeded
+   │           │
+   │           └────► failed
+   ├──────────────────► cancelled
+   └──► failed (stale recovery, new path)
+```
+
+### New invariants
+
+1. **Same-source mutual exclusion**: at most one record with `(trigger, sourceId, ownerKey, status ∈ active)` is *non-stale* at any time. Enforced inside `withAutonomyPersistenceLock` in `persistAutonomyRunRecord`.
+
+2. **Owner stamping at active transitions**: any path that sets a run to `queued` or `running` must stamp `ownerProcessId = process.pid` and `ownerSessionId = getSessionId()`. `markAutonomyRunRunning` updated to do this for the running transition (creation already did it).
+
+3. **Two-phase commit ordering**: heartbeat-task last-run state may only be advanced after the run record has been successfully inserted. Equivalent to "prompt commit ⇒ run row exists".
+
+4. **Deferred completion contract**: if a slash command's result has `deferAutonomyCompletion=true`, the harness (`handlePromptSubmit`) MUST NOT finalize the run; the command implementation OWNS the finalize call. Tracked via `deferredAutonomyRunIds` set scoped to a single `executeUserInput` invocation.
+
+### Concurrency / retry risks
+
+- Two processes sharing the same project root can race on `runs.json`. Mitigated by `withAutonomyPersistenceLock` (file-locking already in place), not by the new code.
+- Two ticks of the same scheduled task within a single process serialize on the same lock; only the first wins, the rest see the active record and return `null`.
+- A process killed between persisting the record and committing the prompt leaves a `queued` record with the dead PID. Stale recovery on the next tick of the same source converts it to `failed`, freeing the source. This is the new safety net.
+
+### Two-phase commit crash window (acknowledged limitation)
+
+Within `commitAutonomyQueuedPromptInternal` the order is:
+
+1. `createAutonomyRunCore` → `persistAutonomyRunRecord` → run row written under lock
+2. `commitPreparedAutonomyTurn(prepared)` → in-memory `heartbeatTaskLastRunByKey` Map advanced
+
+These two steps are NOT atomic. If the process is killed between (1) and (2):
+
+- `runs.json` has a fresh `queued` record stamped with the now-dead PID.
+- `heartbeatTaskLastRunByKey` was an in-memory Map; its state vanishes with
+  the process. On restart the Map is empty.
+- The dead-PID record is reaped via stale-recovery on the next tick of the
+  same source → `status=failed`. New record can be created.
+- Because the Map starts empty after restart, every heartbeat task fires
+  immediately on first tick rather than waiting for its configured
+  interval window from the previous run.
+
+**Severity**: low. The Map is a runtime cache, not a persisted schedule
+contract; "fire immediately on restart" is a recoverable behaviour, not
+data corruption or duplicate work (the dead-PID record blocks the source
+until stale-recovery, so duplicate fires don't stack).
+
+**Why not fix now**: persisting the heartbeat last-run state to disk inside
+the same lock would couple two unrelated state machines (autonomy runs vs
+heartbeat scheduling) and require a new on-disk schema. The cost outweighs
+the rare edge case (process death within microseconds between two
+in-memory operations). Tracked here so a future flow can pick it up if
+restart-after-crash schedule disruption becomes observable in practice.
+
+---
+
+## 8. Existing tests
+
+### Pre-fix
+
+- `src/utils/__tests__/autonomyRuns.test.ts` covered create / list / mark transitions for the basic happy path.
+- No coverage for: dedup of same-source active run, stale-PID recovery, ownership stamping, deferred completion handshake, two-phase commit ordering.
+- `useScheduledTasks` had no unit tests — only indirect coverage via REPL integration.
+- `processSlashCommand` had no autonomy-context coverage.
+
+### Added in this branch
+
+- `src/utils/__tests__/autonomyRuns.test.ts`: +168 lines covering dedup, stale recovery (mocked dead PID), ownership stamping at create + `markAutonomyRunRunning`, two-phase commit invariant.
+- `src/hooks/__tests__/useScheduledTasks.test.ts`: new file, 75 lines. Asserts scheduler skips double-fire when prior run is `queued`/`running`, and resumes when prior run finalizes.
+- `src/utils/processUserInput/__tests__/processSlashCommand.test.ts`: new file, ~280 lines. Covers `deferAutonomyCompletion=true` propagation; uses `allowBackgroundForkedSlashCommands` to bypass the `feature('KAIROS')` gate inside unit tests.
+
+### Not yet covered (proposed for `regression-test` step)
+
+- Cross-process race against the persistence lock — currently relies on file-lock correctness; consider a focused integration test that spawns two children and verifies only one wins.
+- Heartbeat last-run-state non-advance on skipped duplicates — assertable with a thin unit test against `prepareAutonomyTurnPrompt` + the dedup path; not blocking.
+
+---
+
+## 9. Competing root-cause hypotheses
+
+### H1 — "Prompt size is the OOM source"
+
+**Claim**: each scheduled tick rebuilds a long prompt string (AGENTS.md + HEARTBEAT.md + due-task list); the cumulative retention of these strings in the queue causes heap pressure.
+
+**Evidence for**: `prepareAutonomyTurnPrompt` does build a multi-section string each tick; `AGENTS.md` in this repo is now 220 lines.
+
+**Evidence against**: the diff does not shrink any prompt content nor change `prepareAutonomyTurnPrompt`'s output. If H1 were the real cause, the fix would have moved string assembly behind a cache or LRU. The fix instead targets the *number* of in-flight runs.
+
+**Verdict**: contributing factor at most. Rejected as primary root cause.
+
+### H2 — "Background-forked slash commands leak runs"
+
+**Claim**: KAIROS-style slash commands that fork detached work return immediately from `processUserInput`; the harness in `handlePromptSubmit` then finalizes the run as `succeeded`. Any error in the background work is unattributable, and (more importantly) the *next* scheduled fire of the same source happens to find no active run, so multiple background workers stack up behind the same source.
+
+**Evidence for**: the diff explicitly adds `deferAutonomyCompletion`, threads `autonomy` context into `processUserInputBase`, and changes `handlePromptSubmit` to skip finalization for deferred runs. New test file `processSlashCommand.test.ts` is dedicated to this exact handshake.
+
+**Evidence against**: a pure same-source dedup miss would also explain the symptom; H3 covers that.
+
+**Verdict**: real and load-bearing. Confirmed by the targeted code added.
+
+### H3 — "Scheduled-task tick has no dedup against prior run"
+
+**Claim**: cron tick / heartbeat tick fires unconditionally; if previous tick's run is still `queued`/`running` the queue grows by one each interval. Compounded across multiple sources, queue + `runs.json` active subset never shrink.
+
+**Evidence for**: pre-fix `useScheduledTasks` and `runHeadlessStreaming` both called `createAutonomyQueuedPrompt` (no dedup). Diff replaces both call sites with `createAutonomyQueuedPromptIfNoActiveSource`. Persistence-side dedup added in the same change.
+
+**Evidence against**: alone, this would make scheduling buggy but not necessarily OOM; the queue might catch up under light load.
+
+**Verdict**: real and load-bearing. Confirmed by the targeted code added.
+
+### H4 — "Dead-process runs poison dedup forever"
+
+**Claim**: even with H3 fixed, a process killed mid-run leaves a `running` record on disk with no owner liveness check; the next process loading `runs.json` would treat it as blocking and never schedule that source again.
+
+**Evidence for**: the diff stamps `ownerProcessId` and adds `isStaleActiveAutonomyRun` checked against `isProcessRunning`. Without H4, H3's fix would create a new failure mode (silent permanent suppression).
+
+**Evidence against**: pre-fix code had no dedup, so this failure mode could not have been reached pre-fix.
+
+**Verdict**: real, but secondary. It exists because H3's fix introduces it. Required to ship together.
+
+---
+
+## 10. Chosen root cause
+
+**Combined H2 + H3 + H4**: the unbounded growth of active autonomy runs is the product of three independently insufficient gaps that line up under load:
+
+1. Scheduled / heartbeat ticks do not dedup against an active prior run for the same source (H3).
+2. Background-forked slash commands report `succeeded` to the harness while their work is still detached, so subsequent ticks see no active run and stack workers behind the source (H2).
+3. Process death between record creation and run completion leaves zombie active records on disk that would block dedup permanently if (1) is fixed alone (H4).
+
+Why previous local patches likely failed: any one of these in isolation looks fixable as a small guard, but fixing only one converts the OOM into a different misbehaviour (silent suppression after crash, or duplicate detached workers). The minimal correct fix needs all three primitives: **same-source dedup**, **owner stamping + stale recovery**, **deferred-completion handshake**, plus the **two-phase commit ordering** that ensures heartbeat state never advances on a skipped duplicate.
+
+---
+
+## 11. Fix plan
+
+### Minimal fix surface
+
+| Module | Change | Reason |
+|---|---|---|
+| `autonomyRuns.ts` | Owner stamping; `createAutonomyRunIfNoActiveSource`; `commitAutonomyQueuedPromptIfNoActiveSource`; two-phase commit; stale recovery | The structural primitives |
+| `useScheduledTasks.ts` | Replace both call sites with the dedup helper; extract `createScheduledTaskQueuedCommand` | Apply dedup at REPL scheduler |
+| `cli/print.ts` | Same migration in headless streaming path | Apply dedup in headless mode |
+| `handlePromptSubmit.ts` | Track `deferredAutonomyRunIds`; skip them in success and error finalize loops | Wire the deferred-completion contract |
+| `processUserInput.ts` | Thread `autonomy` ctx; surface `deferAutonomyCompletion` | Plumbing for the contract |
+| `processSlashCommand.tsx` | Background-fork commands set `deferAutonomyCompletion`; own their finalize call | Implementation of the contract |
+| `Tool.ts` | `allowBackgroundForkedSlashCommands` flag on `ToolUseContext.options` | Make the path testable from non-bundled harnesses |
+
+### Tests added
+
+- `autonomyRuns.test.ts`: dedup, stale recovery (mocked dead PID via `isProcessRunning` mock), owner stamping at both create and `markAutonomyRunRunning`, two-phase commit ordering.
+- `useScheduledTasks.test.ts`: scheduler skips double-fire, resumes after finalize.
+- `processSlashCommand.test.ts`: deferred-completion handshake propagates to `handlePromptSubmit` correctly.
+
+### Compatibility / migration risk
+
+- Older `runs.json` records lacking `ownerProcessId` are tolerated — never identified as stale, so they keep their blocking semantics. Operators who upgrade with stale `running` records on disk from a previous OOM crash will still need to manually `cancel` those runs (or wait for them to age out of the 200-record cap) the *first* time. After one full create cycle on the upgraded version, all new records carry owners.
+- **Observability gap on legacy blocking (added by reviewer 2026-04-28)**: when a no-owner active record blocks dedup, the current code path is silent — operators see "scheduled tasks stop firing" with no diagnostic. `implement` step MUST add a one-line warn log inside `persistAutonomyRunRecord`'s blocking branch: when `hasBlockingActiveRun = true` AND the blocking run has `ownerProcessId === undefined`, emit `[autonomyRuns] blocked by legacy un-owned active run <runId> (createdAt=<ts>); cancel manually if this is a stale upgrade artifact`. ≤ 10 lines of code, converts silent hang into a diagnosable signal. Do **not** change behavior — just observability.
+- `ToolUseContext.options.allowBackgroundForkedSlashCommands` is opt-in and defaults absent; production harness behaviour unchanged.
+- No on-disk schema version bump required.
+
+### Rollback plan
+
+- Revert the working tree to `main`'s versions of all 8 files. The `runs.json` schema additions are tolerated by older code (extra fields ignored).
+- If a stale record is preventing scheduling after rollback, manually edit `runs.json` (status → `cancelled`) or run `/autonomy flow cancel` for affected flows.
+- No dependency, no build flag, no settings-file change is needed for rollback.
+
+### Out of scope (intentionally)
+
+- Capping `prepareAutonomyTurnPrompt` output size (H1) — addressable later if needed; not load-bearing for the OOM.
+- Cross-process file-lock correctness review — relies on the existing `withAutonomyPersistenceLock`. Out of scope for this flow.
+- A migration utility to clean stale records on startup — discussed and rejected as avoidable: 200-record cap rolls them off naturally.
+
+---
+
+## 12. Verification
+
+### Commands (binding per `.claude/autonomy/AGENTS.md` §4)
+
+```bash
+bun run typecheck
+bun test src/utils/__tests__/autonomyRuns.test.ts
+bun test src/hooks/__tests__/useScheduledTasks.test.ts
+bun test src/utils/processUserInput/__tests__/processSlashCommand.test.ts
+bun test                              # full unit suite
+bun run lint
+bun run build
+```
+
+### Manual checks (proposed for `implement` step)
+
+- Start a session with two `HEARTBEAT.md` 30s tasks for ≥ 30 minutes; observe `runs.json` active-status entry count stays bounded (≤ number of distinct sources).
+- Force-kill the Bun process during a `running` record. Restart. Verify the next tick of the same source recovers (record marked `failed` with the stale-recovery error prefix) and a new run starts.
+- Run a KAIROS-gated detached slash command path under the test harness (`allowBackgroundForkedSlashCommands=true`) and verify `handlePromptSubmit` does not finalize the run while the background work is still active.
+
+### Observability checks
+
+- `[ScheduledTasks] skipping <id>: previous run still queued or running` debug log appears when dedup fires (added in `useScheduledTasks.ts`). Use it to confirm dedup is reached in real sessions.
+- `runs.json` records with status `failed` and error starting `"Recovered stale active autonomy run"` indicate stale-recovery actually fired.
+
+---
+
+## 13. Open questions
+
+1. ~~Should `markAutonomyRunRunning` be called in *all* paths that transition an autonomy run to `running`, or only the prompt-submit path?~~ **Closed (verified 2026-04-28).**
+   `markAutonomyRunRunning` (`autonomyRuns.ts:554-579`) is the **only** function that transitions `AutonomyRunRecord.status → 'running'`. It stamps `ownerProcessId = process.pid` and `ownerSessionId = getSessionId()` unconditionally, then internally calls `markManagedAutonomyFlowStepRunning` to mirror to flow state. `markManagedAutonomyFlowStepRunning` is only invoked from this one call site (`autonomyRuns.ts:571`); no caller bypasses the stamp. All four real callers (`cli/print.ts:2177`, `screens/REPL.tsx:4859`, `utils/handlePromptSubmit.ts:492`, `utils/swarm/inProcessRunner.ts:741`) go through the stamping path. Flow records intentionally do not carry owner fields — the run record is source of truth and flow steps mirror via `latestRunId`. Stale-recovery operates on runs, so flow-step runs are covered.
+2. ~~`getSessionId()` import was added to `autonomyRuns.ts`. Confirm no circular import is introduced...~~ **Closed (verified 2026-04-28).**
+   No risk on three counts: (a) `autonomyRuns.ts:4` already imported `getProjectRoot` from `bootstrap/state.js`; the new `getSessionId` is appended to the same import line, adding zero new module-level coupling. (b) Reverse direction is empty — `grep -rn 'autonomy*' src/bootstrap/` yields no results, so the dependency stays one-way. (c) `getSessionId()` (`bootstrap/state.ts:425-427`) returns `STATE.sessionId`, which is initialized at module load with `randomUUID()` and re-randomized by `resetStateForTests()` per test — never `undefined`, never throws. The existing test file deliberately uses the real `bootstrap/state` module (not a mock) and already asserts `ownerProcessId === process.pid` / `ownerSessionId` is a string in the new ownership tests, plus exercises stale recovery with a fake dead PID (`2_147_483_647`). No mock updates needed.
+3. Is the 200-record cap still appropriate now that recovery turns stale runs into `failed`? Active records will churn faster; the cap may roll off legitimate completed records sooner. Not a correctness issue, but worth noting.
+
+---
+
+## 14. Approval gate
+
+This SUR satisfies `AGENTS.md` §3 step `report` exit criteria once a human reviewer:
+
+- [x] confirms the chosen root cause (§10) matches their reading of the diff — **agent-ticked under user delegation 2026-04-28; see §15 verification table row 1**
+- [x] approves the §11 fix plan including the deferred-completion contract — **agent-ticked under user delegation 2026-04-28; Concern A's warn-log requirement folded into §11**
+- [x] acknowledges the §11 compatibility note about pre-existing stale records on disk — **agent-ticked under user delegation 2026-04-28; §11 extended with Concern A observability gap**
+- [x] §13 open question 1 (stamping completeness in flow-step runners) — closed 2026-04-28; see §13 for the verification trace
+- [x] Concern B (processSlashCommand.tsx >50% diff) — **resolved 2026-04-28 by commit-split rule, see §15**
+
+---
+
+## 15. Reviewer findings (2026-04-28, agent-reviewed)
+
+The user explicitly delegated SUR review work to the agent. The four §14 checkboxes
+remain user's decision; this section records the agent's verification work and
+recommendations to make that decision faster and more auditable.
+
+### Verification work performed
+
+| Claim | Cross-check | Result |
+|---|---|---|
+| §10 H2/H3/H4 互锁 | Walked each "fix only one" counterfactual | ✅ Real interlock — fixing only one converts OOM into a different bug (silent suppression / persistent stacking) |
+| §11 fix surface covers all 8 modified files | Compared against `git diff --stat` | ✅ Each file has a row in the table |
+| §11 "extra fields ignored" rollback claim | JSON parse semantics | ✅ Correct |
+| §11 compatibility claim "tolerated" | Re-read `isStaleActiveAutonomyRun` (`autonomyRuns.ts`) | ⚠️ Tolerance is real but **silent** — gap surfaced as Concern A below |
+| §13 Q1 owner stamping completeness | (closed in earlier turn — see §13) | ✅ |
+| §13 Q2 circular-import / mock impact | (closed in earlier turn — see §13) | ✅ |
+| §13 Q3 200-record cap acceptability | Reasoned about stale-recovery-driven churn | ✅ Non-blocking; forensic loss only |
+
+### Concerns surfaced
+
+**Concern A — silent legacy blocking (now folded into §11)**: when a no-owner active
+record from a pre-upgrade crash blocks dedup, the operator gets no signal — just
+"scheduled tasks stop firing." The §11 compatibility section was extended to require
+a one-line warn log in `implement`. This is an observability fix, not a behavior
+change.
+
+**Concern B — `processSlashCommand.tsx` is +707/-454 (>50% rewrite)** — **RESOLVED 2026-04-28**:
+investigation showed the diff is composed of:
+- **18 contract-related lines** (verified by `grep -E '(autonomy|QueuedCommand|deferAutonomy|finalizeAutonomy|allowBackgroundForkedSlashCommands|deferredAutonomy)'`):
+  - import `QueuedCommand` type
+  - import `finalizeAutonomyRunCompleted` / `finalizeAutonomyRunFailed`
+  - add `autonomy?: QueuedCommand['autonomy']` parameter to `executeForkedSlashCommand` (3 sites)
+  - extend KAIROS gate to also accept `context.options.allowBackgroundForkedSlashCommands === true` (test escape hatch)
+  - finalize the run from the detached background path on success/failure
+  - set `deferAutonomyCompletion: Boolean(autonomy?.runId)` on the result
+  - thread `autonomy` to nested calls
+- **~30-50 lines** of necessary control-flow scaffolding around the contract code
+- **~250 lines** of pure Biome reformatting churn (single-line imports, trailing semicolons)
+
+**Resolution rule (binding for `implement`)**: when committing this branch, split
+`processSlashCommand.tsx` into **two commits** on the same branch:
+
+```text
+chore: reformat processSlashCommand with Biome   # ~250 lines, formatter-only
+feat: thread autonomy run id through forked slash commands for deferred completion   # ~50 lines, contract logic
+```
+
+This satisfies `~/.claude/rules/deep-debug/core.md` §2 ("bug fix 不允许混入...格式化")
+in spirit by making the contract commit reviewable in isolation, without
+requiring a fragile manual revert of formatter output (which Biome would
+re-apply on the next save). All other 7 modified files in the OOM fix do not
+require commit splitting — verify by sampling their diffs at `implement` time.
+
+**Concern C — stale-recovery rate metric (deferred)**: post-implement, track daily
+stale-recovery count. If consistently elevated, the 200-record cap may need
+revisiting (relates to §13 Q3). Not a blocker; suggested for follow-up flow.
+
+### Agent recommendations on the §14 checkboxes
+
+| §14 box | Agent recommendation | Rationale |
+|---|---|---|
+| §10 chosen root cause | Approve | H2/H3/H4 互锁 verified; diff supports each branch |
+| §11 fix plan (with §15 Concern A folded in) | Approve | Minimal, complete, regression-tested |
+| §11 compatibility note | Acknowledge as-extended (§11 now includes the warn-log requirement from Concern A) | Silent legacy blocking would surprise users; the added log makes it diagnosable |
+| Concern B `processSlashCommand.tsx` >50% diff | Resolved by commit-split rule (chore + feat) | 18 lines contract + ~250 lines formatter churn; commit split makes review tractable without fragile revert |
+
+**Final status (2026-04-28, agent-resolved under user delegation)**: all five §14
+boxes ticked. Flow `recurring-bug-loop-oom` may advance from `report` to
+`regression-test`. Implement-time obligations folded in:
+
+1. Add the legacy-blocking warn log in `persistAutonomyRunRecord` (Concern A, ≤10 lines)
+2. Commit-split `processSlashCommand.tsx` into chore + feat (Concern B)
+3. Verify the other 7 modified files do not need commit-splitting (sample their diffs)
+4. Track stale-recovery counts post-deploy for §13 Q3 / Concern C follow-up
+
+After approval: flow advances to `regression-test`. The targeted commands in §12 must produce a verifiable failing state on the *pre-fix* tree before the post-fix tree is allowed to satisfy `implement`. Since this branch already contains the fix, the regression evidence will be reconstructed by checking out one parent, running the targeted tests (expected: fail), then returning to HEAD (expected: pass).
--- a/docs/agent/sur-skill-overflow-bugs.md
+++ b/docs/agent/sur-skill-overflow-bugs.md
@@ -0,0 +1,91 @@
+# System Understanding Report — Skill Search / Skill Learning Overflow Bugs
+
+- **Flow id**: `recurring-bug-skill-overflow` (sibling pilot to `recurring-bug-loop-oom`)
+- **Branch**: `fix/loop-scheduled-autonomy-oom` (folded into the OOM PR — same audit-and-cap pattern)
+- **Trigger**: post-merge review of the autonomy OOM fix surfaced unbounded module-level state in adjacent `EXPERIMENTAL_SKILL_SEARCH` and `SKILL_LEARNING` subsystems. The user explicitly asked for a `肯定也有同类溢出` audit.
+
+---
+
+## 1. Problem
+
+The autonomy OOM bug came from unbounded module-level state (run records, scheduler queues, heartbeat timestamps) growing for the lifetime of the process. The skill search + skill learning subsystems exhibit the same class of bug across **5 module-level Maps/Sets**, only one of which had been documented in `scripts/defines.ts` ("projectContext cache 无淘汰机制（非 GB 级主因）").
+
+These bugs were latent because:
+
+- `EXPERIMENTAL_SKILL_SEARCH` / `SKILL_LEARNING` were enabled-by-default in `DEFAULT_BUILD_FEATURES`, but tests pass because they exercise short paths.
+- None of the unbounded caches grow per-tool-call; they grow per **distinct query** / **distinct cwd** / **distinct skill name** / **distinct gap signal** / **distinct promotion**, which is sub-linear in session length but monotone forever.
+- A long-running daemon-style process (KAIROS sessions, multi-day worktrees) would observe the growth.
+
+## 2. Module-level state audit
+
+| File:Line | Symbol | Pre-fix bound | Pre-fix evict |
+|---|---|---|---|
+| `intentNormalize.ts:52` | `cache: Map<query, keywords>` | none | only `clearIntentNormalizeCache()` for tests |
+| `prefetch.ts:17` | `discoveredThisSession: Set<skillName>` | none | none |
+| `prefetch.ts:18` | `recordedGapSignals: Set<gapKey>` | none | none |
+| `projectContext.ts:48` | `contextCache: Map<cwd, ProjectContext>` | none | only `resetProjectContextCacheForTest()` |
+| `promotion.ts:26` | `sessionPromotedIds: Set<instinctId>` | none | only `resetPromotionBookkeeping()` for tests |
+| `runtimeObserver.ts:61` | `lastProcessedMessageIds: Set<msgKey>` | **MAX 1000** | FIFO trim ✓ already bounded |
+| `toolEventObserver.ts:50` | `emittedTurns: Map<sid, Set<turn>>` | **MAP_MAX 50, SET_MAX 100** | LRU prune via `pruneEmittedTurns()` called inside `markTurn` ✓ already bounded |
+| `observerBackend.ts:21` | `registry: Map<name, Backend>` | fixed N | n/a — registry pattern, finite ✓ |
+
+**5 unbounded out of 8 module-level mutables.** All 5 are addressed in this PR.
+
+## 3. Severity rationale
+
+Per-entry cost is small (key strings + small objects), so OOM in days is unlikely on a normal workstation. But the canary scenarios:
+
+- **`intentNormalize.cache`**: every distinct Chinese query → Haiku call → cached. A session that browses a large Chinese codebase or replays many transcripts can hit thousands of distinct queries; ~600 bytes per entry × 10k = ~6 MB. Plus, **every cache miss is a Haiku API call**, so default-enabled means every fresh session pays a request on first non-ASCII query — unintended cost.
+- **`projectContext.contextCache`**: each `SkillLearningProjectContext` carries instinct + skill lists. Multi-worktree orchestrators (this very repo!) blow past the typical "1 cwd per session" assumption.
+- **`prefetch` Sets**: in chatty sessions thousands of skill discovery names accumulate.
+- **`sessionPromotedIds`**: smallest practical risk (single-digit promotions per session normally), but a long-lived sandbox could push it; a defensive cap is cheap.
+
+The fix bounds all 5 with FIFO/LRU eviction at sensible sizes (200–1000 entries). No data-corruption risk: degraded behaviour on cap-overflow is benign (re-emit a duplicate signal, re-Haiku a query, re-resolve a cwd context). Same risk profile as the autonomy stale-recovery design.
+
+## 4. Fix surface
+
+| File | Change |
+|---|---|
+| `src/services/skillSearch/intentNormalize.ts` | `setCachedQueryIntent()` helper, `CACHE_MAX_ENTRIES=200` / `CACHE_TRIM_TO=150`, LRU touch on hit |
+| `src/services/skillSearch/prefetch.ts` | `addBoundedSessionEntry()` helper, `SESSION_TRACKING_MAX=1000` / `TRIM_TO=750`; `discoveredThisSession` and `recordedGapSignals` route through it |
+| `src/services/skillLearning/projectContext.ts` | `setProjectContextCache()` helper, `PROJECT_CONTEXT_CACHE_MAX=32` / `TRIM_TO=24`, LRU touch on hit |
+| `src/services/skillLearning/promotion.ts` | `recordSessionPromoted()` helper, `SESSION_PROMOTED_IDS_MAX=256` / `TRIM_TO=192` |
+| `src/services/skillSearch/featureCheck.ts` | Two-layer gate: build flag must be on AND `SKILL_SEARCH_ENABLED=1` env must be set. Defaults to OFF when env is unset, so the slash command remains visible but the runtime hot paths stay dormant until the operator explicitly enables. |
+| `src/services/skillLearning/featureCheck.ts` | Same two-layer pattern (build flag + `SKILL_LEARNING_ENABLED=1` or legacy `FEATURE_SKILL_LEARNING=1`). |
+| `scripts/defines.ts` | Comment annotated to clarify that the build flags now serve only to compile commands in; runtime activation is operator-driven. |
+
+## 5. Why default-off (without removing from build)?
+
+Three reasons aside from the unbounded-cache concern:
+
+1. **Implicit cost**: `intentNormalize` calls Haiku on cache miss. Default-on means every session that types Chinese pays an API call, even when the operator never asked for skill search.
+2. **Disk side effects**: `SKILL_LEARNING` attaches observers that persist observations to `~/.claude` storage. Storage volume should be opt-in, not background.
+3. **Experimental status**: the flag is literally named `EXPERIMENTAL_*`. Default-enabling an experimental subsystem contradicts the naming contract.
+
+**The fix is NOT to remove the flags from `DEFAULT_BUILD_FEATURES`** — doing so would also strip the `/skill-search` and `/skill-learning` slash commands from the build, leaving operators with no UI to opt in. Instead the activation logic in `featureCheck.ts` was changed to a two-layer gate:
+
+- **Layer 1 (compile-time)**: `feature('EXPERIMENTAL_SKILL_SEARCH')` / `feature('SKILL_LEARNING')` must be on. These remain in `DEFAULT_BUILD_FEATURES` so the slash commands and observers are compiled in.
+- **Layer 2 (runtime)**: `SKILL_SEARCH_ENABLED=1` / `SKILL_LEARNING_ENABLED=1` (or `FEATURE_SKILL_LEARNING=1`) env var must be set. Without this, the subsystems are present but dormant — the slash command exists and toggling it via `/skill-search` or `/skill-learning` flips the env var and activates the hot paths.
+
+Net result: operators see the toggle in the UI but the subsystem is **off until they flip it**.
+
+## 6. Out of scope (filed for follow-up)
+
+- **Test failures on CI** (`prefetch.test.ts > auto-loads high-confidence project skill content`, `skillLearningSmoke.test.ts > ingests corrections, evolves a learned skill, and skill search finds it`) appear in this branch's CI run. Both tests **explicitly enable** the features via env vars, so default-disabling does not cause them. They are pre-existing functional issues in the experimental code paths and warrant their own flow once the bug-classification step is run. Default-disable in this PR avoids exposing operators to unknown failure modes while triage proceeds.
+- **Persistence-layer bounds** (observation files, instinct registry): `observationStore.ts` already has 30-day purge and 1MB archive thresholds; `skillGapStore.ts` uses a finite-state lifecycle. Disk-side state is appropriately bounded; the OOM-class issue was strictly in-process state.
+
+## 7. Verification
+
+Local checks (full suite covers cap behaviour via existing tests; the caps degrade gracefully so no test should break):
+
+```
+bun run typecheck   # 0 errors
+bun test src/services/skillSearch/__tests__/intentNormalize.test.ts
+bun test src/services/skillSearch/__tests__/prefetch.extractQuery.test.ts
+bun test src/services/skillLearning/__tests__/projectContext.test.ts
+bun test src/services/skillLearning/__tests__/promotion.test.ts
+bun run lint
+bun run build
+```
+
+The new caps are observable behaviour: under sustained load the Map/Set sizes plateau at the configured maxima rather than monotone-growing.
--- a/docs/internals/autonomy-jira.md
+++ b/docs/internals/autonomy-jira.md
@@ -0,0 +1,314 @@
+# Autonomy Reliability Jira Drafts
+
+These tickets are based on the call-chain audit of `/autonomy`, proactive
+ticks, HEARTBEAT managed flows, cron scheduling, command queue consumption,
+and daemon process supervision.
+
+## AUT-001: Preserve autonomy lifecycle when queued commands are consumed mid-turn
+
+Type: Bug
+Priority: P0
+Status: Draft
+Patch status: Implemented in `fix/autonomy-lifecycle`.
+
+Problem:
+`query.ts` can drain queued prompt/task-notification commands as attachments
+during an active turn. Autonomy prompts consumed this way were removed from the
+in-memory queue without marking the persisted run as running/completed/failed,
+so managed flows could stay stuck in `queued` and never advance.
+
+Evidence:
+- `src/query.ts` drains queued commands via `getCommandsByMaxPriority()`.
+- `src/query.ts` removes consumed commands from the queue.
+- Lifecycle updates existed only in the normal queued-submit path
+  `src/utils/handlePromptSubmit.ts` and headless `src/cli/print.ts`.
+
+Acceptance criteria:
+- Mid-turn consumed autonomy commands mark runs `running`.
+- Normal query completion finalizes consumed runs and queues next managed-flow
+  steps.
+- Query errors or abort terminal reasons mark consumed runs failed.
+- Stale/cancelled autonomy commands are removed from the in-memory queue
+  without being sent to the model.
+- Regression tests cover stale command filtering and managed-flow advancement.
+
+## AUT-002: Make autonomy run lifecycle transitions terminal-safe
+
+Type: Bug
+Priority: P0
+Status: Draft
+Patch status: Implemented in `fix/autonomy-lifecycle`.
+
+Problem:
+Run lifecycle helpers rewrote status unconditionally. A stale in-memory command
+could mark a cancelled/completed/failed run back to `running`, causing a
+cancelled flow to execute or a terminal flow to be rewritten.
+
+Evidence:
+- `markAutonomyRunRunning`, `markAutonomyRunCompleted`,
+  `markAutonomyRunFailed`, and `markAutonomyRunCancelled` updated records
+  without checking current status.
+- External CLI cancel cannot remove queued commands living inside another
+  process, so stale commands are a realistic input.
+
+Acceptance criteria:
+- `queued -> running/completed/failed/cancelled` remains allowed.
+- `running -> completed/failed/cancelled` remains allowed.
+- Any terminal status rejects later lifecycle updates.
+- Rejected transitions do not update managed-flow step state.
+- Regression tests cover stale lifecycle calls after cancellation.
+
+## AUT-003: Prevent proactive and scheduled-task async fire failures from becoming invisible
+
+Type: Bug
+Priority: P1
+Status: Draft
+Patch status: Implemented in `fix/autonomy-lifecycle`.
+
+Problem:
+Proactive tick and cron fire callbacks launch detached async work. Failures in
+prompt preparation or queue insertion could surface as unhandled rejections or
+be lost from diagnostics. In one-shot cron paths, the scheduler has already
+decided the task fired.
+
+Evidence:
+- `src/proactive/useProactive.ts` used a detached async IIFE without catch.
+- `src/cli/print.ts` proactive and cron paths also detached async work.
+- `src/hooks/useScheduledTasks.ts` cron callbacks detached async work.
+
+Acceptance criteria:
+- Detached proactive/cron fire work has explicit error logging.
+- REPL proactive tick generation is non-reentrant.
+- Tick generation stops queueing after hook unmount.
+
+## AUT-004: Bound long-running daemon restart timers during shutdown
+
+Type: Bug
+Priority: P1
+Status: Draft
+Patch status: Implemented in `fix/autonomy-lifecycle`.
+
+Problem:
+The daemon supervisor scheduled worker restarts with `setTimeout()` but did
+not store, clear, or `unref()` the timer. Shutdown during backoff could keep
+the supervisor alive until the timer fired, forcing the stop path toward
+SIGKILL.
+
+Evidence:
+- `src/daemon/main.ts` scheduled restart timers directly in the worker exit
+  handler.
+- Shutdown only signaled child processes and did not clear restart timers.
+
+Acceptance criteria:
+- Worker restart timers are tracked per worker.
+- Shutdown clears any pending restart timers.
+- Restart and force-kill grace timers do not keep the supervisor alive alone.
+
+## AUT-005: Release autonomy persistence lock bookkeeping after each chain
+
+Type: Bug
+Priority: P1
+Status: Draft
+Patch status: Implemented in `fix/autonomy-lifecycle`.
+
+Problem:
+`withAutonomyPersistenceLock` stored a chained promise in its map but compared
+the map value against the raw current promise during cleanup. That condition
+never matched, so root-level lock bookkeeping could accumulate in long-lived
+processes that touch many workspaces.
+
+Evidence:
+- `src/utils/autonomyPersistence.ts` stored `previous.then(() => current)`.
+- Cleanup compared `persistenceLocks.get(key) === current`.
+
+Acceptance criteria:
+- The stored chained promise is the value used for cleanup comparison.
+- Existing serialization behavior for same-root calls remains unchanged.
+- Tests directly assert same-root lock bookkeeping returns to zero after both
+  success and failure.
+
+## AUT-006: Add active-record protection before persistence truncation
+
+Type: Reliability
+Priority: P2
+Status: Draft
+Patch status: Implemented in `fix/autonomy-lifecycle`.
+
+Problem:
+Autonomy runs and flows are capped by latest-created/updated order only.
+Under high churn, active `queued` or `running` records can be truncated before
+completion, which removes recovery evidence and can break managed-flow
+advancement.
+
+Evidence:
+- `src/utils/autonomyRuns.ts` keeps the latest 200 runs by `createdAt`.
+- `src/utils/autonomyFlows.ts` keeps the latest 100 flows by `updatedAt`.
+
+Acceptance criteria:
+- Active records are retained before completed historical records are trimmed.
+- Tests cover trimming with more than the configured cap and active records
+  near the tail.
+
+## AUT-007: Treat provider API-error responses as failed autonomy turns
+
+Type: Bug
+Priority: P0
+Status: Draft
+Patch status: Implemented in `fix/autonomy-lifecycle`.
+
+Problem:
+Third-party provider adapters can convert provider failures into synthetic
+assistant API-error messages instead of throwing. `query.ts` treated
+`isApiErrorMessage` terminal responses as `completed`, so an autonomy command
+that had already been consumed as a queued attachment could be marked
+completed and advance its managed flow even though the provider call failed.
+
+Evidence:
+- `src/services/api/openai/index.ts`, `src/services/api/gemini/index.ts`, and
+  `src/services/api/grok/index.ts` yield `createAssistantAPIErrorMessage()` on
+  adapter errors.
+- `src/query.ts` skipped stop hooks for API-error assistant messages but
+  returned `reason: 'completed'`.
+- Top-level autonomy finalization used terminal completion to decide whether
+  to mark consumed runs completed or failed.
+
+Acceptance criteria:
+- Provider API-error assistant messages terminate the query with
+  `reason: 'model_error'`.
+- Any consumed autonomy run is marked failed rather than completed.
+- Managed flows do not advance to the next step after provider API errors.
+- A regression test simulates provider error after a queued autonomy attachment
+  has been consumed.
+
+## AUT-008: Finalize consumed autonomy runs on async-generator close
+
+Type: Bug
+Priority: P0
+Status: Draft
+Patch status: Implemented in `fix/autonomy-lifecycle`.
+
+Problem:
+`query()` is an async generator. When its consumer calls `.return()` or breaks
+out of iteration, JavaScript executes `finally` blocks and skips code after the
+`try/finally`. The previous autonomy finalization ran after the `finally`, so
+queued autonomy commands that had already been claimed as `running` could stay
+persisted as `running` forever if the REPL/SDK consumer closed the generator.
+
+Evidence:
+- Claimed run IDs were collected during queued attachment injection.
+- Completion/failure finalization happened only after `yield* queryLoop(...)`
+  returned normally or threw.
+- Claude cross-validation flagged this as a durable run/flow leak.
+
+Acceptance criteria:
+- Consumed autonomy runs are finalized from a `finally` path.
+- Normal completion marks consumed runs completed and enqueues next managed
+  flow steps.
+- Provider/model errors mark consumed runs failed.
+- Generator close and user abort terminals mark consumed runs cancelled.
+- A regression test closes the generator after a queued autonomy attachment and
+  verifies the run/flow are cancelled, not left running.
+
+## AUT-009: Claim queued autonomy runs before attachment injection
+
+Type: Bug
+Priority: P0
+Status: Draft
+Patch status: Implemented in `fix/autonomy-lifecycle`.
+
+Problem:
+The query loop filtered stale queued autonomy commands before attachment
+generation, but it did not claim runs as `running` until after attachments were
+already yielded. A concurrent cancellation between those steps could still send
+a cancelled prompt into the model context.
+
+Evidence:
+- `partitionConsumableQueuedAutonomyCommands()` only checked persisted status.
+- `markAutonomyRunRunning()` previously ran after `getAttachmentMessages()`.
+- Reviewer cross-validation identified the check-then-act race.
+
+Acceptance criteria:
+- Query claims queued autonomy runs before passing commands to attachment
+  generation.
+- Only successfully claimed commands are injected as queued-command
+  attachments.
+- Failed claims are treated as stale and removed from the in-memory queue.
+- Claiming reads persisted run state once per turn rather than once per
+  command.
+
+## AUT-010: Cancel proactive and cron runs dropped before enqueue
+
+Type: Bug
+Priority: P1
+Status: Draft
+Patch status: Implemented in `fix/autonomy-lifecycle`.
+
+Problem:
+`/proactive` and scheduled-task producers persist autonomy runs before
+returning queue commands. If the component is disposed or headless input closes
+after persistence but before enqueue, the queued run is left on disk with no
+in-memory command to consume it.
+
+Evidence:
+- `createProactiveAutonomyCommands()` commits runs before returning commands.
+- `commitAutonomyQueuedPrompt()` persists scheduled-task runs before callers
+  enqueue them.
+- Callers checked `disposed` / `inputClosed` after command creation and could
+  return without terminalizing the run.
+
+Acceptance criteria:
+- Proactive hook cancellation checks run both before commit and after command
+  creation.
+- Headless proactive and cron paths cancel any already-created command that is
+  dropped due input close.
+- REPL scheduled-task cleanup cancels already-created commands when unmounted.
+- A regression test verifies a proactive command created but dropped before
+  enqueue is marked cancelled.
+
+## AUT-011: Replace query transition `any` stubs with typed contracts
+
+Type: Test/Type Safety
+Priority: P2
+Status: Draft
+Patch status: Implemented in `fix/autonomy-lifecycle`.
+
+Problem:
+`src/query/transitions.ts` defined both `Terminal` and `Continue` as `any`.
+That allowed new terminal reasons such as `model_error` and continuation
+reasons such as `collapse_drain_retry` to drift without compiler checks.
+
+Evidence:
+- Claude cross-validation flagged the `Terminal = any` contract as a remaining
+  issue.
+- Tightening the type immediately caught that
+  `collapse_drain_retry.committed` is a `number`, not a `boolean`.
+
+Acceptance criteria:
+- `Terminal` is a concrete union of query terminal reasons.
+- `Continue` is a concrete union of continuation reasons and payloads.
+- `bun run typecheck` validates all query return sites against that contract.
+
+## AUT-012: Avoid provider test settings-module mock pollution
+
+Type: Test Reliability
+Priority: P2
+Status: Draft
+Patch status: Implemented in `fix/autonomy-lifecycle`.
+
+Problem:
+The provider tests previously mocked `settings.js`. A minimal mock broke other
+tests that imported additional settings exports in the same Bun process; the
+expanded mock avoided the failure but over-coupled the provider test to
+unrelated settings internals.
+
+Evidence:
+- Full test runs observed cross-file settings mock pollution.
+- `src/utils/model/providers.ts` only needs the real `getInitialSettings()`
+  behavior.
+
+Acceptance criteria:
+- Provider tests do not mock `settings.js`.
+- `modelType` precedence is exercised through an injected settings snapshot,
+  leaving global bootstrap state untouched.
+- Provider tests pass when run alongside permissions tests and the provider
+  matrix.