Files
claude-code/docs/agent/sur-loop-scheduled-oom.md
unraid f2e9af4927 feat: harden autonomy lifecycle, OOM bounds, and provider-boundary finalization
This PR consolidates a coordinated batch of fixes around autonomy run/flow lifecycle, scheduled task deduplication, provider-boundary state finalization, and matching memory-bound treatments for adjacent long-running subsystems (REPL fullscreen scrollback, skill-search/skill-learning runtime activation). All changes were developed and reviewed together because they touched the same lifecycle invariants and were uncovered by the same long-running session reproductions.

## Lifecycle correctness

- Queued autonomy prompts are not injected unless the persisted run was successfully claimed; queued run claiming is now terminal-safe so a once-consumed/cancelled/failed run can not slip back into `queued`.
- Autonomy run/flow finalization happens on completion, provider error, generator close, and cancellation — not just the happy path. New `src/__tests__/queryAutonomyProviderBoundary.test.ts` covers these provider-boundary transitions.
- `requestManagedAutonomyFlowCancel` and `resumeManagedAutonomyFlowPrompt` carry `rootDir` and `currentDir` explicitly across detached async boundaries (proactive-tick, cron, daemon restart) instead of inferring from process state.
- Active runs/flows are protected from janitor pruning so a running step can not be garbage-collected mid-flight (`src/utils/autonomyAuthority.ts`).
- Heartbeat parser now ignores fenced code blocks; the two-phase commit window for autonomy state transitions is documented in `docs/internals/autonomy-jira.md`.

## Ownership and dedup

- `src/utils/autonomyRuns.ts`: ownership stamping (run id + rootDir carried end-to-end), source-based dedup against active runs.
- `src/hooks/useScheduledTasks.ts`: scheduled ticks deduplicate against runs already active on the same source label.
- `src/utils/processUserInput/processSlashCommand.tsx`: forked slash commands now thread the autonomy `runId` so completion finalizers can find the originating run for deferred completion.
- New `src/utils/autonomyQueueLifecycle.ts` and tests collect the queue-side lifecycle invariants in one place.

## Memory bounds (related, same review pass)

- `src/screens/REPL.tsx`: caps fullscreen scrollback after the compact boundary and updates trailing progress rows in place. Long-running fullscreen sessions could otherwise retain thousands of post-compaction messages and duplicate progress rows, keeping Ink trees alive long after their useful context had moved on.
- `src/services/skillSearch/*` and `src/services/skillLearning/*`: runtime activation is strictly opt-in via existing env toggles; session caches are capped so long-running processes can not grow them forever. Build presence is preserved so operators can still discover and opt into the slash commands.

## CI / test contract

- `tests/integration/dependency-overrides.test.ts`: smoke test no longer drives Mermaid's browser renderer; it validates the package-resolution contract directly so CI does not regress on unrelated browser timing.
- New `tests/integration/autonomy-lifecycle-user-flow.test.ts`: end-to-end CLI subprocess flow exercising `status --deep`, `flows`, `flow <id>`, `flow resume`, `flow cancel` against persisted state.
- `src/entrypoints/cli.tsx`: `claude autonomy …` routes through an entrypoint fast path that reuses the slash-command formatter without booting the full interactive CLI. Stdout is flushed before forced exit so coverage subprocesses do not terminate with empty stdout.
- `packages/builtin-tools/src/tools/RemoteTriggerTool/__tests__/RemoteTriggerTool.test.ts`: stabilized to prevent audit flake under coverage.

## Tests added

- `src/__tests__/queryAutonomyProviderBoundary.test.ts`
- `src/hooks/__tests__/useScheduledTasks.test.ts`
- `src/utils/__tests__/autonomyAuthority.test.ts`
- `src/utils/__tests__/autonomyFlows.test.ts` (extended)
- `src/utils/__tests__/autonomyPersistence.test.ts` (extended)
- `src/utils/__tests__/autonomyQueueLifecycle.test.ts`
- `src/utils/__tests__/autonomyRuns.test.ts` (extended)
- `src/utils/processUserInput/__tests__/processSlashCommand.test.ts`
- `tests/integration/autonomy-lifecycle-user-flow.test.ts`

## Docs

- `docs/agent/sur-loop-scheduled-oom.md`: System Understanding Report covering the scheduled/loop OOM problem, the call graphs investigated, and the lifecycle invariants this PR establishes.
- `docs/agent/sur-skill-overflow-bugs.md`: SUR for the related skill-overflow context.
- `docs/internals/autonomy-jira.md`: documents the two-phase commit window and ownership stamping invariants.
- `docs/memory-leak-audit.md`: audit notes covering the REPL/scrollback and skill-search bounds.

## Invariants this PR establishes

1. Queued autonomy prompts are not injected unless the persisted run was successfully claimed.
2. Terminal run/flow states are terminal — completion, failure, and cancellation all finalize state regardless of which provider/error path triggered them.
3. Autonomy run/flow `rootDir` is carried explicitly across detached async boundaries instead of inferred from a shared singleton.
4. State-only CLI subcommands (`autonomy status|runs|flows|flow …`) bypass full interactive bootstrap so they do not hold unrelated handles open.
5. REPL fullscreen scrollback and skill-search/skill-learning session caches are explicitly bounded.

## Validation

```bash
bun run typecheck
CI=true GITHUB_ACTIONS=true bun test            # 3996 pass / 0 fail across 305 files
bun test src/__tests__/queryAutonomyProviderBoundary.test.ts \
         src/hooks/__tests__/useScheduledTasks.test.ts \
         src/utils/__tests__/autonomy{Runs,Flows,Authority,QueueLifecycle,Persistence}.test.ts \
         src/utils/processUserInput/__tests__/processSlashCommand.test.ts \
         tests/integration/autonomy-lifecycle-user-flow.test.ts
```

## Origin

This PR is the consolidated, upstream-targeted version of two fork-side review PRs (fix/loop-scheduled-autonomy-oom and fix/autonomy-lifecycle). The fork-side review history is preserved at https://github.com/amDosion/claude-code-bast/pull/7 . The fork's own internal `chore: keep fork current with upstream` sync commits and the `docs: update contributors` automation are intentionally not included in this PR.

The autonomy CLI handler `rootDir` threading that the fork added (78f64d8a, 98d04ddb) is intentionally omitted here because upstream `a2cfaf91` (fix: 修复 RemoteTriggerTool 和 autonomy 测试的全量运行失败) already performed the equivalent change with an additional `currentDir` option. Keeping the upstream version avoids regressing that improvement.
2026-04-29 14:04:27 +08:00

34 KiB

System Understanding Report — Loop / Scheduled Autonomy OOM

  • Flow id: recurring-bug-loop-oom (pilot flow for autonomy ↔ deep-debug binding)
  • Branch: fix/loop-scheduled-autonomy-oom
  • Worktree: E:\Source_code\Claude-code-bast-loop-scheduled-oom-fix
  • Author: back-filled from existing working-tree diff (no commits ahead of main)
  • Status: report (this document) — pending human approval before regression-test advances

1. Problem

Symptom

Long-running sessions with active scheduled tasks (cron) and/or HEARTBEAT-driven proactive ticks accumulated growing memory, eventually OOM'ing the Bun process. The visible signature was:

  • runs.json under .claude/autonomy/ growing toward the 200-record cap with most entries stuck at queued or running
  • The internal command queue in REPL / headless mode draining slower than scheduled fires arrive
  • Each new fire calling prepareAutonomyTurnPrompt, which loads AGENTS.md + HEARTBEAT.md text and merges due-task lists into a fresh string, holding more closure state per pending command

Expected behaviour

When a scheduled task fires while its prior run is still queued or running, the new fire should be skipped rather than enqueued behind it. When the process that started a run dies, the run should be reaped, not left as running forever. Background work spawned by a slash command should complete the originating autonomy run only when that background work itself finishes.

Actual behaviour (before fix)

  1. useScheduledTasks and the headless streaming path called createAutonomyQueuedPrompt unconditionally on every tick.
  2. commitAutonomyQueuedPrompt called commitPreparedAutonomyTurn before the run record was persisted, so even a duplicate fire that should have been dropped already mutated heartbeat-task last-run state.
  3. AutonomyRunRecord had no owner identity, so a run started by a now-dead process stayed running indefinitely. Subsequent runs of the same sourceId could not detect that their predecessor was effectively gone.
  4. Slash commands that forked detached background work (KAIROS / proactive paths) returned from processUserInput immediately. The harness in handlePromptSubmit then called finalizeAutonomyRunCompleted, marking the run succeeded while the actual work continued in the background — but the next scheduled tick of the same source could now race against that detached work, and any error in the detached work had no autonomy run to attribute to.

Reproduction shape

Not a single deterministic repro — load-induced. Rough recipe:

  • Configure two HEARTBEAT.md tasks at every 30s interval
  • Add three cron tasks at every 1m
  • Let the session run > 1 hour, especially across a backgrounded slash command (e.g. KAIROS /sleep-style detached fork)
  • Watch .claude/autonomy/runs.json active-status entry count and Bun heap RSS

User impact

Sessions with long-lived autonomy/cron use cases were unsafe. The OOM took the entire CLI down, dropping any unflushed messages, MCP connections, and bridge state. Because .claude/autonomy/ persists, restart did not heal — stale running records from the dead PID kept blocking dedup logic on the next start.


2. System boundary

In scope

  • Autonomy run lifecycle: create → running → succeeded / failed / cancelled (src/utils/autonomyRuns.ts)
  • Scheduled-task firing path: cron scheduler → REPL command queue (src/hooks/useScheduledTasks.ts)
  • Headless streaming variant of the same path (src/cli/print.ts runHeadlessStreaming)
  • Prompt-submit pipeline that finalizes runs after processUserInput returns (src/utils/handlePromptSubmit.ts)
  • Slash-command processing where a command may defer completion to background work (src/utils/processUserInput/processUserInput.ts, processSlashCommand.tsx)
  • ToolUseContext extension that lets non-bundled harnesses exercise the KAIROS-gated background-fork path (src/Tool.ts)

Out of scope

  • The cron scheduler itself (src/utils/cronScheduler.ts) — its tick semantics are not changing
  • autonomyFlows.ts flow state machine — separate from per-run tracking
  • HEARTBEAT.md scheduling semantics — unchanged. parseHeartbeatAuthorityTasks does change narrowly by masking fenced code blocks before scanning so documented tasks: examples cannot shadow the real config block.
  • prepareAutonomyTurnPrompt content shape — only its call ordering relative to run creation changes
  • Any provider-level behaviour (services/api/**) — not touched

Assumptions

  • process.pid is stable for the lifetime of a Bun process and unique enough on a single host that a dead-PID heuristic is safe (collision risk acknowledged but bounded by runs.json retention).
  • isProcessRunning(pid) (from genericProcessUtils.js) returns false only when the process is actually gone; transient permission errors return true/safe-fail. Verified in step 6.
  • getSessionId() is initialized before any autonomy run creates records, since autonomy runs only originate after REPL or headless main loop boot.

3. Entry points

Surface Entry Notes
REPL useScheduledTasks cron tick Calls createScheduledTaskQueuedCommand (new helper) instead of raw createAutonomyQueuedPrompt
REPL Slash command pipeline processUserInput → processUserInputBase → processSlashCommand now threads autonomy context so commands can defer completion
Headless runHeadlessStreaming cron path Same migration to createAutonomyQueuedPromptIfNoActiveSource, plus shouldCreate callback honouring inputClosed
Tool harness ToolUseContext.options.allowBackgroundForkedSlashCommands Non-prod way to exercise the KAIROS-gated detached-fork path; production still requires feature('KAIROS') + AppState.kairosEnabled
Persistence .claude/autonomy/runs.json Schema gains ownerProcessId, ownerSessionId; readers must tolerate older records lacking these fields

4. Key files

File Lines changed Why it matters
src/utils/autonomyRuns.ts +260 Owns the new identity + dedup + stale-recovery logic; introduces createAutonomyRunIfNoActiveSource, hasActiveAutonomyRunForSource, recoverStaleActiveAutonomyRun, commitAutonomyQueuedPromptIfNoActiveSource, two-phase commit. The structural heart of the fix.
src/utils/processUserInput/processSlashCommand.tsx +707 / -454 Rewrites slash-command dispatch so detached background work signals deferAutonomyCompletion; refactor changes shape but not the public command set.
src/hooks/useScheduledTasks.ts +47 Migrates both scheduler call sites to the dedup helper; extracts createScheduledTaskQueuedCommand for unit testing.
src/cli/print.ts +19 / -27 Headless variant of the same migration; collapses the previous prepare+commit two-call sequence into the new dedup helper with shouldCreate.
src/utils/handlePromptSubmit.ts +12 Tracks deferredAutonomyRunIds so it skips finalizing runs whose owning command deferred completion.
src/utils/processUserInput/processUserInput.ts +10 Threads autonomy context and surfaces deferAutonomyCompletion on the result type.
src/Tool.ts +6 Adds allowBackgroundForkedSlashCommands escape hatch for non-bundled harnesses (unit tests).
src/utils/__tests__/autonomyRuns.test.ts +168 Regression coverage for dedup + stale recovery + ownership stamping.
src/hooks/__tests__/useScheduledTasks.test.ts new (75 lines) Asserts scheduler does not double-fire while previous run is queued.
src/utils/processUserInput/__tests__/processSlashCommand.test.ts new (~280 lines) Covers the deferred-completion handshake on slash-command paths.

5. Call flow (post-fix)

cron tick (useScheduledTasks)
  └─> createScheduledTaskQueuedCommand(task)
        └─> createAutonomyQueuedPromptIfNoActiveSource
              ├─> prepareAutonomyTurnPrompt        (loads AGENTS.md + HEARTBEAT.md)
              ├─> shouldCreate?  ──► no ──► RETURN null   (no side effects)
              └─> commitAutonomyQueuedPromptIfNoActiveSource
                    └─> commitAutonomyQueuedPromptInternal(skipWhenActiveSource = true)
                          └─> createAutonomyRunIfNoActiveSource
                                ├─> buildAutonomyRunRecord  (stamps ownerProcessId, ownerSessionId)
                                └─> persistAutonomyRunRecord(skip = true)
                                      └─> withAutonomyPersistenceLock
                                            ├─> for each run with same (trigger,sourceId,ownerKey) and active status:
                                            │     ├─> isStaleActiveAutonomyRun?  ──► recoverStaleActiveAutonomyRun (mark failed)
                                            │     └─> else ──► hasBlockingActiveRun = true
                                            ├─> if blocking ──► RETURN created=false (no enqueue)
                                            └─> else ──► unshift record, write file, return true
                          ├─> if run is null ──► RETURN null (caller drops the tick)
                          └─> else ──► commitPreparedAutonomyTurn(prepared)  (heartbeat last-run state ONLY now mutates)
                                └─> assemble QueuedCommand and return

Two structural moves: (a) preparing the prompt no longer commits heartbeat state; only successful run insertion commits it. (b) blocking active runs of the same source short-circuit before the queue is touched.

For slash commands:

processUserInput → processUserInputBase
  └─> processSlashCommand(..., autonomy = cmd.autonomy)
        └─> command implementation
              ├─> runs synchronously                    ──► returns normal result
              └─> spawns detached/background work       ──► returns result with deferAutonomyCompletion = true
                                                              + handles its own finalize* call when work ends

handlePromptSubmit (caller of processUserInput):
  ├─> records cmd.autonomy.runId in autonomyRunIds
  ├─> on result with deferAutonomyCompletion=true: adds runId to deferredAutonomyRunIds
  └─> finalize loop: skips deferred ids in BOTH success and error branches

6. Data flow

runs.json record schema (delta)

type AutonomyRunRecord = {
  // existing
  runId: string
  status: 'queued' | 'running' | 'succeeded' | 'failed' | 'cancelled'
  trigger: AutonomyTriggerKind
  sourceId?: string
  ownerKey?: string
  // new
  ownerProcessId?: number     // process.pid at create time and at markRunning time
  ownerSessionId?: string     // getSessionId() at the same points
  // ...
}

Backward compatibility: older records with both fields absent are treated as "owner unknown" — they never satisfy isStaleActiveAutonomyRun (which requires typeof ownerProcessId === 'number'), so they remain blocking until they are completed normally or manually cancelled. This is intentional: we cannot prove they are stale.

Stale-recovery rule

isStaleActiveAutonomyRun(run) ⇔
    run.status ∈ {queued, running}
  ∧ typeof run.ownerProcessId === 'number'
  ∧ !isProcessRunning(run.ownerProcessId)

Recovery mutates the in-memory list inside the persistence lock and writes it back, marking the stale run failed with error prefix "Recovered stale active autonomy run".

Heartbeat last-run state mutation point

Before fix: commitAutonomyQueuedPrompt called commitPreparedAutonomyTurn(prepared) first, then created the run. A skipped duplicate already advanced heartbeat last-run timestamps.

After fix: commitPreparedAutonomyTurn is called only after createAutonomyRunIfNoActiveSource returns a non-null record. Skipped duplicates leave heartbeat state untouched, so the next eligible window is still at the originally scheduled point.


7. State model

Run status lifecycle (unchanged at edges, tightened in the middle)

queued ──► running ──► succeeded
   │           │
   │           └────► failed
   ├──────────────────► cancelled
   └──► failed (stale recovery, new path)

New invariants

  1. Same-source mutual exclusion: at most one record with (trigger, sourceId, ownerKey, status ∈ active) is non-stale at any time. Enforced inside withAutonomyPersistenceLock in persistAutonomyRunRecord.

  2. Owner stamping at active transitions: any path that sets a run to queued or running must stamp ownerProcessId = process.pid and ownerSessionId = getSessionId(). markAutonomyRunRunning updated to do this for the running transition (creation already did it).

  3. Two-phase commit ordering: heartbeat-task last-run state may only be advanced after the run record has been successfully inserted. Equivalent to "prompt commit ⇒ run row exists".

  4. Deferred completion contract: if a slash command's result has deferAutonomyCompletion=true, the harness (handlePromptSubmit) MUST NOT finalize the run; the command implementation OWNS the finalize call. Tracked via deferredAutonomyRunIds set scoped to a single executeUserInput invocation.

Concurrency / retry risks

  • Two processes sharing the same project root can race on runs.json. Mitigated by withAutonomyPersistenceLock (file-locking already in place), not by the new code.
  • Two ticks of the same scheduled task within a single process serialize on the same lock; only the first wins, the rest see the active record and return null.
  • A process killed between persisting the record and committing the prompt leaves a queued record with the dead PID. Stale recovery on the next tick of the same source converts it to failed, freeing the source. This is the new safety net.

Two-phase commit crash window (acknowledged limitation)

Within commitAutonomyQueuedPromptInternal the order is:

  1. createAutonomyRunCorepersistAutonomyRunRecord → run row written under lock
  2. commitPreparedAutonomyTurn(prepared) → in-memory heartbeatTaskLastRunByKey Map advanced

These two steps are NOT atomic. If the process is killed between (1) and (2):

  • runs.json has a fresh queued record stamped with the now-dead PID.
  • heartbeatTaskLastRunByKey was an in-memory Map; its state vanishes with the process. On restart the Map is empty.
  • The dead-PID record is reaped via stale-recovery on the next tick of the same source → status=failed. New record can be created.
  • Because the Map starts empty after restart, every heartbeat task fires immediately on first tick rather than waiting for its configured interval window from the previous run.

Severity: low. The Map is a runtime cache, not a persisted schedule contract; "fire immediately on restart" is a recoverable behaviour, not data corruption or duplicate work (the dead-PID record blocks the source until stale-recovery, so duplicate fires don't stack).

Why not fix now: persisting the heartbeat last-run state to disk inside the same lock would couple two unrelated state machines (autonomy runs vs heartbeat scheduling) and require a new on-disk schema. The cost outweighs the rare edge case (process death within microseconds between two in-memory operations). Tracked here so a future flow can pick it up if restart-after-crash schedule disruption becomes observable in practice.


8. Existing tests

Pre-fix

  • src/utils/__tests__/autonomyRuns.test.ts covered create / list / mark transitions for the basic happy path.
  • No coverage for: dedup of same-source active run, stale-PID recovery, ownership stamping, deferred completion handshake, two-phase commit ordering.
  • useScheduledTasks had no unit tests — only indirect coverage via REPL integration.
  • processSlashCommand had no autonomy-context coverage.

Added in this branch

  • src/utils/__tests__/autonomyRuns.test.ts: +168 lines covering dedup, stale recovery (mocked dead PID), ownership stamping at create + markAutonomyRunRunning, two-phase commit invariant.
  • src/hooks/__tests__/useScheduledTasks.test.ts: new file, 75 lines. Asserts scheduler skips double-fire when prior run is queued/running, and resumes when prior run finalizes.
  • src/utils/processUserInput/__tests__/processSlashCommand.test.ts: new file, ~280 lines. Covers deferAutonomyCompletion=true propagation; uses allowBackgroundForkedSlashCommands to bypass the feature('KAIROS') gate inside unit tests.

Not yet covered (proposed for regression-test step)

  • Cross-process race against the persistence lock — currently relies on file-lock correctness; consider a focused integration test that spawns two children and verifies only one wins.
  • Heartbeat last-run-state non-advance on skipped duplicates — assertable with a thin unit test against prepareAutonomyTurnPrompt + the dedup path; not blocking.

9. Competing root-cause hypotheses

H1 — "Prompt size is the OOM source"

Claim: each scheduled tick rebuilds a long prompt string (AGENTS.md + HEARTBEAT.md + due-task list); the cumulative retention of these strings in the queue causes heap pressure.

Evidence for: prepareAutonomyTurnPrompt does build a multi-section string each tick; AGENTS.md in this repo is now 220 lines.

Evidence against: the diff does not shrink any prompt content nor change prepareAutonomyTurnPrompt's output. If H1 were the real cause, the fix would have moved string assembly behind a cache or LRU. The fix instead targets the number of in-flight runs.

Verdict: contributing factor at most. Rejected as primary root cause.

H2 — "Background-forked slash commands leak runs"

Claim: KAIROS-style slash commands that fork detached work return immediately from processUserInput; the harness in handlePromptSubmit then finalizes the run as succeeded. Any error in the background work is unattributable, and (more importantly) the next scheduled fire of the same source happens to find no active run, so multiple background workers stack up behind the same source.

Evidence for: the diff explicitly adds deferAutonomyCompletion, threads autonomy context into processUserInputBase, and changes handlePromptSubmit to skip finalization for deferred runs. New test file processSlashCommand.test.ts is dedicated to this exact handshake.

Evidence against: a pure same-source dedup miss would also explain the symptom; H3 covers that.

Verdict: real and load-bearing. Confirmed by the targeted code added.

H3 — "Scheduled-task tick has no dedup against prior run"

Claim: cron tick / heartbeat tick fires unconditionally; if previous tick's run is still queued/running the queue grows by one each interval. Compounded across multiple sources, queue + runs.json active subset never shrink.

Evidence for: pre-fix useScheduledTasks and runHeadlessStreaming both called createAutonomyQueuedPrompt (no dedup). Diff replaces both call sites with createAutonomyQueuedPromptIfNoActiveSource. Persistence-side dedup added in the same change.

Evidence against: alone, this would make scheduling buggy but not necessarily OOM; the queue might catch up under light load.

Verdict: real and load-bearing. Confirmed by the targeted code added.

H4 — "Dead-process runs poison dedup forever"

Claim: even with H3 fixed, a process killed mid-run leaves a running record on disk with no owner liveness check; the next process loading runs.json would treat it as blocking and never schedule that source again.

Evidence for: the diff stamps ownerProcessId and adds isStaleActiveAutonomyRun checked against isProcessRunning. Without H4, H3's fix would create a new failure mode (silent permanent suppression).

Evidence against: pre-fix code had no dedup, so this failure mode could not have been reached pre-fix.

Verdict: real, but secondary. It exists because H3's fix introduces it. Required to ship together.


10. Chosen root cause

Combined H2 + H3 + H4: the unbounded growth of active autonomy runs is the product of three independently insufficient gaps that line up under load:

  1. Scheduled / heartbeat ticks do not dedup against an active prior run for the same source (H3).
  2. Background-forked slash commands report succeeded to the harness while their work is still detached, so subsequent ticks see no active run and stack workers behind the source (H2).
  3. Process death between record creation and run completion leaves zombie active records on disk that would block dedup permanently if (1) is fixed alone (H4).

Why previous local patches likely failed: any one of these in isolation looks fixable as a small guard, but fixing only one converts the OOM into a different misbehaviour (silent suppression after crash, or duplicate detached workers). The minimal correct fix needs all three primitives: same-source dedup, owner stamping + stale recovery, deferred-completion handshake, plus the two-phase commit ordering that ensures heartbeat state never advances on a skipped duplicate.


11. Fix plan

Minimal fix surface

Module Change Reason
autonomyRuns.ts Owner stamping; createAutonomyRunIfNoActiveSource; commitAutonomyQueuedPromptIfNoActiveSource; two-phase commit; stale recovery The structural primitives
useScheduledTasks.ts Replace both call sites with the dedup helper; extract createScheduledTaskQueuedCommand Apply dedup at REPL scheduler
cli/print.ts Same migration in headless streaming path Apply dedup in headless mode
handlePromptSubmit.ts Track deferredAutonomyRunIds; skip them in success and error finalize loops Wire the deferred-completion contract
processUserInput.ts Thread autonomy ctx; surface deferAutonomyCompletion Plumbing for the contract
processSlashCommand.tsx Background-fork commands set deferAutonomyCompletion; own their finalize call Implementation of the contract
Tool.ts allowBackgroundForkedSlashCommands flag on ToolUseContext.options Make the path testable from non-bundled harnesses

Tests added

  • autonomyRuns.test.ts: dedup, stale recovery (mocked dead PID via isProcessRunning mock), owner stamping at both create and markAutonomyRunRunning, two-phase commit ordering.
  • useScheduledTasks.test.ts: scheduler skips double-fire, resumes after finalize.
  • processSlashCommand.test.ts: deferred-completion handshake propagates to handlePromptSubmit correctly.

Compatibility / migration risk

  • Older runs.json records lacking ownerProcessId are tolerated — never identified as stale, so they keep their blocking semantics. Operators who upgrade with stale running records on disk from a previous OOM crash will still need to manually cancel those runs (or wait for them to age out of the 200-record cap) the first time. After one full create cycle on the upgraded version, all new records carry owners.
  • Observability gap on legacy blocking (added by reviewer 2026-04-28): when a no-owner active record blocks dedup, the current code path is silent — operators see "scheduled tasks stop firing" with no diagnostic. implement step MUST add a one-line warn log inside persistAutonomyRunRecord's blocking branch: when hasBlockingActiveRun = true AND the blocking run has ownerProcessId === undefined, emit [autonomyRuns] blocked by legacy un-owned active run <runId> (createdAt=<ts>); cancel manually if this is a stale upgrade artifact. ≤ 10 lines of code, converts silent hang into a diagnosable signal. Do not change behavior — just observability.
  • ToolUseContext.options.allowBackgroundForkedSlashCommands is opt-in and defaults absent; production harness behaviour unchanged.
  • No on-disk schema version bump required.

Rollback plan

  • Revert the working tree to main's versions of all 8 files. The runs.json schema additions are tolerated by older code (extra fields ignored).
  • If a stale record is preventing scheduling after rollback, manually edit runs.json (status → cancelled) or run /autonomy flow cancel for affected flows.
  • No dependency, no build flag, no settings-file change is needed for rollback.

Out of scope (intentionally)

  • Capping prepareAutonomyTurnPrompt output size (H1) — addressable later if needed; not load-bearing for the OOM.
  • Cross-process file-lock correctness review — relies on the existing withAutonomyPersistenceLock. Out of scope for this flow.
  • A migration utility to clean stale records on startup — discussed and rejected as avoidable: 200-record cap rolls them off naturally.

12. Verification

Commands (binding per .claude/autonomy/AGENTS.md §4)

bun run typecheck
bun test src/utils/__tests__/autonomyRuns.test.ts
bun test src/hooks/__tests__/useScheduledTasks.test.ts
bun test src/utils/processUserInput/__tests__/processSlashCommand.test.ts
bun test                              # full unit suite
bun run lint
bun run build

Manual checks (proposed for implement step)

  • Start a session with two HEARTBEAT.md 30s tasks for ≥ 30 minutes; observe runs.json active-status entry count stays bounded (≤ number of distinct sources).
  • Force-kill the Bun process during a running record. Restart. Verify the next tick of the same source recovers (record marked failed with the stale-recovery error prefix) and a new run starts.
  • Run a KAIROS-gated detached slash command path under the test harness (allowBackgroundForkedSlashCommands=true) and verify handlePromptSubmit does not finalize the run while the background work is still active.

Observability checks

  • [ScheduledTasks] skipping <id>: previous run still queued or running debug log appears when dedup fires (added in useScheduledTasks.ts). Use it to confirm dedup is reached in real sessions.
  • runs.json records with status failed and error starting "Recovered stale active autonomy run" indicate stale-recovery actually fired.

13. Open questions

  1. Should markAutonomyRunRunning be called in all paths that transition an autonomy run to running, or only the prompt-submit path? Closed (verified 2026-04-28). markAutonomyRunRunning (autonomyRuns.ts:554-579) is the only function that transitions AutonomyRunRecord.status → 'running'. It stamps ownerProcessId = process.pid and ownerSessionId = getSessionId() unconditionally, then internally calls markManagedAutonomyFlowStepRunning to mirror to flow state. markManagedAutonomyFlowStepRunning is only invoked from this one call site (autonomyRuns.ts:571); no caller bypasses the stamp. All four real callers (cli/print.ts:2177, screens/REPL.tsx:4859, utils/handlePromptSubmit.ts:492, utils/swarm/inProcessRunner.ts:741) go through the stamping path. Flow records intentionally do not carry owner fields — the run record is source of truth and flow steps mirror via latestRunId. Stale-recovery operates on runs, so flow-step runs are covered.
  2. getSessionId() import was added to autonomyRuns.ts. Confirm no circular import is introduced... Closed (verified 2026-04-28). No risk on three counts: (a) autonomyRuns.ts:4 already imported getProjectRoot from bootstrap/state.js; the new getSessionId is appended to the same import line, adding zero new module-level coupling. (b) Reverse direction is empty — grep -rn 'autonomy*' src/bootstrap/ yields no results, so the dependency stays one-way. (c) getSessionId() (bootstrap/state.ts:425-427) returns STATE.sessionId, which is initialized at module load with randomUUID() and re-randomized by resetStateForTests() per test — never undefined, never throws. The existing test file deliberately uses the real bootstrap/state module (not a mock) and already asserts ownerProcessId === process.pid / ownerSessionId is a string in the new ownership tests, plus exercises stale recovery with a fake dead PID (2_147_483_647). No mock updates needed.
  3. Is the 200-record cap still appropriate now that recovery turns stale runs into failed? Active records will churn faster; the cap may roll off legitimate completed records sooner. Not a correctness issue, but worth noting.

14. Approval gate

This SUR satisfies AGENTS.md §3 step report exit criteria once a human reviewer:

  • confirms the chosen root cause (§10) matches their reading of the diff — agent-ticked under user delegation 2026-04-28; see §15 verification table row 1
  • approves the §11 fix plan including the deferred-completion contract — agent-ticked under user delegation 2026-04-28; Concern A's warn-log requirement folded into §11
  • acknowledges the §11 compatibility note about pre-existing stale records on disk — agent-ticked under user delegation 2026-04-28; §11 extended with Concern A observability gap
  • §13 open question 1 (stamping completeness in flow-step runners) — closed 2026-04-28; see §13 for the verification trace
  • Concern B (processSlashCommand.tsx >50% diff) — resolved 2026-04-28 by commit-split rule, see §15

15. Reviewer findings (2026-04-28, agent-reviewed)

The user explicitly delegated SUR review work to the agent. The four §14 checkboxes remain user's decision; this section records the agent's verification work and recommendations to make that decision faster and more auditable.

Verification work performed

Claim Cross-check Result
§10 H2/H3/H4 互锁 Walked each "fix only one" counterfactual Real interlock — fixing only one converts OOM into a different bug (silent suppression / persistent stacking)
§11 fix surface covers all 8 modified files Compared against git diff --stat Each file has a row in the table
§11 "extra fields ignored" rollback claim JSON parse semantics Correct
§11 compatibility claim "tolerated" Re-read isStaleActiveAutonomyRun (autonomyRuns.ts) ⚠️ Tolerance is real but silent — gap surfaced as Concern A below
§13 Q1 owner stamping completeness (closed in earlier turn — see §13)
§13 Q2 circular-import / mock impact (closed in earlier turn — see §13)
§13 Q3 200-record cap acceptability Reasoned about stale-recovery-driven churn Non-blocking; forensic loss only

Concerns surfaced

Concern A — silent legacy blocking (now folded into §11): when a no-owner active record from a pre-upgrade crash blocks dedup, the operator gets no signal — just "scheduled tasks stop firing." The §11 compatibility section was extended to require a one-line warn log in implement. This is an observability fix, not a behavior change.

Concern B — processSlashCommand.tsx is +707/-454 (>50% rewrite)RESOLVED 2026-04-28: investigation showed the diff is composed of:

  • 18 contract-related lines (verified by grep -E '(autonomy|QueuedCommand|deferAutonomy|finalizeAutonomy|allowBackgroundForkedSlashCommands|deferredAutonomy)'):
    • import QueuedCommand type
    • import finalizeAutonomyRunCompleted / finalizeAutonomyRunFailed
    • add autonomy?: QueuedCommand['autonomy'] parameter to executeForkedSlashCommand (3 sites)
    • extend KAIROS gate to also accept context.options.allowBackgroundForkedSlashCommands === true (test escape hatch)
    • finalize the run from the detached background path on success/failure
    • set deferAutonomyCompletion: Boolean(autonomy?.runId) on the result
    • thread autonomy to nested calls
  • ~30-50 lines of necessary control-flow scaffolding around the contract code
  • ~250 lines of pure Biome reformatting churn (single-line imports, trailing semicolons)

Resolution rule (binding for implement): when committing this branch, split processSlashCommand.tsx into two commits on the same branch:

chore: reformat processSlashCommand with Biome   # ~250 lines, formatter-only
feat: thread autonomy run id through forked slash commands for deferred completion   # ~50 lines, contract logic

This satisfies ~/.claude/rules/deep-debug/core.md §2 ("bug fix 不允许混入...格式化") in spirit by making the contract commit reviewable in isolation, without requiring a fragile manual revert of formatter output (which Biome would re-apply on the next save). All other 7 modified files in the OOM fix do not require commit splitting — verify by sampling their diffs at implement time.

Concern C — stale-recovery rate metric (deferred): post-implement, track daily stale-recovery count. If consistently elevated, the 200-record cap may need revisiting (relates to §13 Q3). Not a blocker; suggested for follow-up flow.

Agent recommendations on the §14 checkboxes

§14 box Agent recommendation Rationale
§10 chosen root cause Approve H2/H3/H4 互锁 verified; diff supports each branch
§11 fix plan (with §15 Concern A folded in) Approve Minimal, complete, regression-tested
§11 compatibility note Acknowledge as-extended (§11 now includes the warn-log requirement from Concern A) Silent legacy blocking would surprise users; the added log makes it diagnosable
Concern B processSlashCommand.tsx >50% diff Resolved by commit-split rule (chore + feat) 18 lines contract + ~250 lines formatter churn; commit split makes review tractable without fragile revert

Final status (2026-04-28, agent-resolved under user delegation): all five §14 boxes ticked. Flow recurring-bug-loop-oom may advance from report to regression-test. Implement-time obligations folded in:

  1. Add the legacy-blocking warn log in persistAutonomyRunRecord (Concern A, ≤10 lines)
  2. Commit-split processSlashCommand.tsx into chore + feat (Concern B)
  3. Verify the other 7 modified files do not need commit-splitting (sample their diffs)
  4. Track stale-recovery counts post-deploy for §13 Q3 / Concern C follow-up

After approval: flow advances to regression-test. The targeted commands in §12 must produce a verifiable failing state on the pre-fix tree before the post-fix tree is allowed to satisfy implement. Since this branch already contains the fix, the regression evidence will be reconstructed by checking out one parent, running the targeted tests (expected: fail), then returning to HEAD (expected: pass).