mirror of
https://github.com/claude-code-best/claude-code.git
synced 2026-06-15 12:55:51 +00:00
This PR consolidates a coordinated batch of fixes around autonomy run/flow lifecycle, scheduled task deduplication, provider-boundary state finalization, and matching memory-bound treatments for adjacent long-running subsystems (REPL fullscreen scrollback, skill-search/skill-learning runtime activation). All changes were developed and reviewed together because they touched the same lifecycle invariants and were uncovered by the same long-running session reproductions.
## Lifecycle correctness
- Queued autonomy prompts are not injected unless the persisted run was successfully claimed; queued run claiming is now terminal-safe so a once-consumed/cancelled/failed run can not slip back into `queued`.
- Autonomy run/flow finalization happens on completion, provider error, generator close, and cancellation — not just the happy path. New `src/__tests__/queryAutonomyProviderBoundary.test.ts` covers these provider-boundary transitions.
- `requestManagedAutonomyFlowCancel` and `resumeManagedAutonomyFlowPrompt` carry `rootDir` and `currentDir` explicitly across detached async boundaries (proactive-tick, cron, daemon restart) instead of inferring from process state.
- Active runs/flows are protected from janitor pruning so a running step can not be garbage-collected mid-flight (`src/utils/autonomyAuthority.ts`).
- Heartbeat parser now ignores fenced code blocks; the two-phase commit window for autonomy state transitions is documented in `docs/internals/autonomy-jira.md`.
## Ownership and dedup
- `src/utils/autonomyRuns.ts`: ownership stamping (run id + rootDir carried end-to-end), source-based dedup against active runs.
- `src/hooks/useScheduledTasks.ts`: scheduled ticks deduplicate against runs already active on the same source label.
- `src/utils/processUserInput/processSlashCommand.tsx`: forked slash commands now thread the autonomy `runId` so completion finalizers can find the originating run for deferred completion.
- New `src/utils/autonomyQueueLifecycle.ts` and tests collect the queue-side lifecycle invariants in one place.
## Memory bounds (related, same review pass)
- `src/screens/REPL.tsx`: caps fullscreen scrollback after the compact boundary and updates trailing progress rows in place. Long-running fullscreen sessions could otherwise retain thousands of post-compaction messages and duplicate progress rows, keeping Ink trees alive long after their useful context had moved on.
- `src/services/skillSearch/*` and `src/services/skillLearning/*`: runtime activation is strictly opt-in via existing env toggles; session caches are capped so long-running processes can not grow them forever. Build presence is preserved so operators can still discover and opt into the slash commands.
## CI / test contract
- `tests/integration/dependency-overrides.test.ts`: smoke test no longer drives Mermaid's browser renderer; it validates the package-resolution contract directly so CI does not regress on unrelated browser timing.
- New `tests/integration/autonomy-lifecycle-user-flow.test.ts`: end-to-end CLI subprocess flow exercising `status --deep`, `flows`, `flow <id>`, `flow resume`, `flow cancel` against persisted state.
- `src/entrypoints/cli.tsx`: `claude autonomy …` routes through an entrypoint fast path that reuses the slash-command formatter without booting the full interactive CLI. Stdout is flushed before forced exit so coverage subprocesses do not terminate with empty stdout.
- `packages/builtin-tools/src/tools/RemoteTriggerTool/__tests__/RemoteTriggerTool.test.ts`: stabilized to prevent audit flake under coverage.
## Tests added
- `src/__tests__/queryAutonomyProviderBoundary.test.ts`
- `src/hooks/__tests__/useScheduledTasks.test.ts`
- `src/utils/__tests__/autonomyAuthority.test.ts`
- `src/utils/__tests__/autonomyFlows.test.ts` (extended)
- `src/utils/__tests__/autonomyPersistence.test.ts` (extended)
- `src/utils/__tests__/autonomyQueueLifecycle.test.ts`
- `src/utils/__tests__/autonomyRuns.test.ts` (extended)
- `src/utils/processUserInput/__tests__/processSlashCommand.test.ts`
- `tests/integration/autonomy-lifecycle-user-flow.test.ts`
## Docs
- `docs/agent/sur-loop-scheduled-oom.md`: System Understanding Report covering the scheduled/loop OOM problem, the call graphs investigated, and the lifecycle invariants this PR establishes.
- `docs/agent/sur-skill-overflow-bugs.md`: SUR for the related skill-overflow context.
- `docs/internals/autonomy-jira.md`: documents the two-phase commit window and ownership stamping invariants.
- `docs/memory-leak-audit.md`: audit notes covering the REPL/scrollback and skill-search bounds.
## Invariants this PR establishes
1. Queued autonomy prompts are not injected unless the persisted run was successfully claimed.
2. Terminal run/flow states are terminal — completion, failure, and cancellation all finalize state regardless of which provider/error path triggered them.
3. Autonomy run/flow `rootDir` is carried explicitly across detached async boundaries instead of inferred from a shared singleton.
4. State-only CLI subcommands (`autonomy status|runs|flows|flow …`) bypass full interactive bootstrap so they do not hold unrelated handles open.
5. REPL fullscreen scrollback and skill-search/skill-learning session caches are explicitly bounded.
## Validation
```bash
bun run typecheck
CI=true GITHUB_ACTIONS=true bun test # 3996 pass / 0 fail across 305 files
bun test src/__tests__/queryAutonomyProviderBoundary.test.ts \
src/hooks/__tests__/useScheduledTasks.test.ts \
src/utils/__tests__/autonomy{Runs,Flows,Authority,QueueLifecycle,Persistence}.test.ts \
src/utils/processUserInput/__tests__/processSlashCommand.test.ts \
tests/integration/autonomy-lifecycle-user-flow.test.ts
```
## Origin
This PR is the consolidated, upstream-targeted version of two fork-side review PRs (fix/loop-scheduled-autonomy-oom and fix/autonomy-lifecycle). The fork-side review history is preserved at https://github.com/amDosion/claude-code-bast/pull/7 . The fork's own internal `chore: keep fork current with upstream` sync commits and the `docs: update contributors` automation are intentionally not included in this PR.
The autonomy CLI handler `rootDir` threading that the fork added (78f64d8a, 98d04ddb) is intentionally omitted here because upstream `a2cfaf91` (fix: 修复 RemoteTriggerTool 和 autonomy 测试的全量运行失败) already performed the equivalent change with an additional `currentDir` option. Keeping the upstream version avoids regressing that improvement.
493 lines
34 KiB
Markdown
493 lines
34 KiB
Markdown
# System Understanding Report — Loop / Scheduled Autonomy OOM
|
|
|
|
- **Flow id**: `recurring-bug-loop-oom` (pilot flow for autonomy ↔ deep-debug binding)
|
|
- **Branch**: `fix/loop-scheduled-autonomy-oom`
|
|
- **Worktree**: `E:\Source_code\Claude-code-bast-loop-scheduled-oom-fix`
|
|
- **Author**: back-filled from existing working-tree diff (no commits ahead of `main`)
|
|
- **Status**: `report` (this document) — pending human approval before `regression-test` advances
|
|
|
|
---
|
|
|
|
## 1. Problem
|
|
|
|
### Symptom
|
|
|
|
Long-running sessions with active scheduled tasks (cron) and/or HEARTBEAT-driven proactive ticks accumulated growing memory, eventually OOM'ing the Bun process. The visible signature was:
|
|
|
|
- `runs.json` under `.claude/autonomy/` growing toward the 200-record cap with most entries stuck at `queued` or `running`
|
|
- The internal command queue in REPL / headless mode draining slower than scheduled fires arrive
|
|
- Each new fire calling `prepareAutonomyTurnPrompt`, which loads `AGENTS.md` + `HEARTBEAT.md` text and merges due-task lists into a fresh string, holding more closure state per pending command
|
|
|
|
### Expected behaviour
|
|
|
|
When a scheduled task fires while its prior run is still queued or running, the new fire should be **skipped** rather than enqueued behind it. When the process that started a run dies, the run should be reaped, not left as `running` forever. Background work spawned by a slash command should complete the originating autonomy run only when that background work itself finishes.
|
|
|
|
### Actual behaviour (before fix)
|
|
|
|
1. `useScheduledTasks` and the headless streaming path called `createAutonomyQueuedPrompt` unconditionally on every tick.
|
|
2. `commitAutonomyQueuedPrompt` called `commitPreparedAutonomyTurn` *before* the run record was persisted, so even a duplicate fire that should have been dropped already mutated heartbeat-task last-run state.
|
|
3. `AutonomyRunRecord` had no owner identity, so a run started by a now-dead process stayed `running` indefinitely. Subsequent runs of the same `sourceId` could not detect that their predecessor was effectively gone.
|
|
4. Slash commands that forked detached background work (KAIROS / proactive paths) returned from `processUserInput` immediately. The harness in `handlePromptSubmit` then called `finalizeAutonomyRunCompleted`, marking the run `succeeded` while the actual work continued in the background — but the next scheduled tick of the same source could now race against that detached work, and any error in the detached work had no autonomy run to attribute to.
|
|
|
|
### Reproduction shape
|
|
|
|
Not a single deterministic repro — load-induced. Rough recipe:
|
|
|
|
- Configure two `HEARTBEAT.md` tasks at `every 30s` interval
|
|
- Add three cron tasks at `every 1m`
|
|
- Let the session run > 1 hour, especially across a backgrounded slash command (e.g. KAIROS `/sleep`-style detached fork)
|
|
- Watch `.claude/autonomy/runs.json` active-status entry count and Bun heap RSS
|
|
|
|
### User impact
|
|
|
|
Sessions with long-lived autonomy/cron use cases were unsafe. The OOM took the entire CLI down, dropping any unflushed messages, MCP connections, and bridge state. Because `.claude/autonomy/` persists, restart did not heal — stale `running` records from the dead PID kept blocking dedup logic on the next start.
|
|
|
|
---
|
|
|
|
## 2. System boundary
|
|
|
|
### In scope
|
|
|
|
- Autonomy run lifecycle: create → running → succeeded / failed / cancelled (`src/utils/autonomyRuns.ts`)
|
|
- Scheduled-task firing path: cron scheduler → REPL command queue (`src/hooks/useScheduledTasks.ts`)
|
|
- Headless streaming variant of the same path (`src/cli/print.ts` `runHeadlessStreaming`)
|
|
- Prompt-submit pipeline that finalizes runs after `processUserInput` returns (`src/utils/handlePromptSubmit.ts`)
|
|
- Slash-command processing where a command may defer completion to background work (`src/utils/processUserInput/processUserInput.ts`, `processSlashCommand.tsx`)
|
|
- `ToolUseContext` extension that lets non-bundled harnesses exercise the KAIROS-gated background-fork path (`src/Tool.ts`)
|
|
|
|
### Out of scope
|
|
|
|
- The cron scheduler itself (`src/utils/cronScheduler.ts`) — its tick semantics are not changing
|
|
- `autonomyFlows.ts` flow state machine — separate from per-run tracking
|
|
- HEARTBEAT.md scheduling semantics — unchanged. `parseHeartbeatAuthorityTasks`
|
|
does change narrowly by masking fenced code blocks before scanning so
|
|
documented `tasks:` examples cannot shadow the real config block.
|
|
- `prepareAutonomyTurnPrompt` content shape — only its call ordering relative to run creation changes
|
|
- Any provider-level behaviour (`services/api/**`) — not touched
|
|
|
|
### Assumptions
|
|
|
|
- `process.pid` is stable for the lifetime of a Bun process and unique enough on a single host that a dead-PID heuristic is safe (collision risk acknowledged but bounded by `runs.json` retention).
|
|
- `isProcessRunning(pid)` (from `genericProcessUtils.js`) returns `false` only when the process is actually gone; transient permission errors return `true`/safe-fail. Verified in step 6.
|
|
- `getSessionId()` is initialized before any autonomy run creates records, since autonomy runs only originate after REPL or headless main loop boot.
|
|
|
|
---
|
|
|
|
## 3. Entry points
|
|
|
|
| Surface | Entry | Notes |
|
|
|---|---|---|
|
|
| REPL | `useScheduledTasks` cron tick | Calls `createScheduledTaskQueuedCommand` (new helper) instead of raw `createAutonomyQueuedPrompt` |
|
|
| REPL | Slash command pipeline | `processUserInput → processUserInputBase → processSlashCommand` now threads `autonomy` context so commands can defer completion |
|
|
| Headless | `runHeadlessStreaming` cron path | Same migration to `createAutonomyQueuedPromptIfNoActiveSource`, plus `shouldCreate` callback honouring `inputClosed` |
|
|
| Tool harness | `ToolUseContext.options.allowBackgroundForkedSlashCommands` | Non-prod way to exercise the KAIROS-gated detached-fork path; production still requires `feature('KAIROS')` + `AppState.kairosEnabled` |
|
|
| Persistence | `.claude/autonomy/runs.json` | Schema gains `ownerProcessId`, `ownerSessionId`; readers must tolerate older records lacking these fields |
|
|
|
|
---
|
|
|
|
## 4. Key files
|
|
|
|
| File | Lines changed | Why it matters |
|
|
|---|---|---|
|
|
| `src/utils/autonomyRuns.ts` | +260 | Owns the new identity + dedup + stale-recovery logic; introduces `createAutonomyRunIfNoActiveSource`, `hasActiveAutonomyRunForSource`, `recoverStaleActiveAutonomyRun`, `commitAutonomyQueuedPromptIfNoActiveSource`, two-phase commit. The structural heart of the fix. |
|
|
| `src/utils/processUserInput/processSlashCommand.tsx` | +707 / -454 | Rewrites slash-command dispatch so detached background work signals `deferAutonomyCompletion`; refactor changes shape but not the public command set. |
|
|
| `src/hooks/useScheduledTasks.ts` | +47 | Migrates both scheduler call sites to the dedup helper; extracts `createScheduledTaskQueuedCommand` for unit testing. |
|
|
| `src/cli/print.ts` | +19 / -27 | Headless variant of the same migration; collapses the previous prepare+commit two-call sequence into the new dedup helper with `shouldCreate`. |
|
|
| `src/utils/handlePromptSubmit.ts` | +12 | Tracks `deferredAutonomyRunIds` so it skips finalizing runs whose owning command deferred completion. |
|
|
| `src/utils/processUserInput/processUserInput.ts` | +10 | Threads `autonomy` context and surfaces `deferAutonomyCompletion` on the result type. |
|
|
| `src/Tool.ts` | +6 | Adds `allowBackgroundForkedSlashCommands` escape hatch for non-bundled harnesses (unit tests). |
|
|
| `src/utils/__tests__/autonomyRuns.test.ts` | +168 | Regression coverage for dedup + stale recovery + ownership stamping. |
|
|
| `src/hooks/__tests__/useScheduledTasks.test.ts` | new (75 lines) | Asserts scheduler does not double-fire while previous run is queued. |
|
|
| `src/utils/processUserInput/__tests__/processSlashCommand.test.ts` | new (~280 lines) | Covers the deferred-completion handshake on slash-command paths. |
|
|
|
|
---
|
|
|
|
## 5. Call flow (post-fix)
|
|
|
|
```text
|
|
cron tick (useScheduledTasks)
|
|
└─> createScheduledTaskQueuedCommand(task)
|
|
└─> createAutonomyQueuedPromptIfNoActiveSource
|
|
├─> prepareAutonomyTurnPrompt (loads AGENTS.md + HEARTBEAT.md)
|
|
├─> shouldCreate? ──► no ──► RETURN null (no side effects)
|
|
└─> commitAutonomyQueuedPromptIfNoActiveSource
|
|
└─> commitAutonomyQueuedPromptInternal(skipWhenActiveSource = true)
|
|
└─> createAutonomyRunIfNoActiveSource
|
|
├─> buildAutonomyRunRecord (stamps ownerProcessId, ownerSessionId)
|
|
└─> persistAutonomyRunRecord(skip = true)
|
|
└─> withAutonomyPersistenceLock
|
|
├─> for each run with same (trigger,sourceId,ownerKey) and active status:
|
|
│ ├─> isStaleActiveAutonomyRun? ──► recoverStaleActiveAutonomyRun (mark failed)
|
|
│ └─> else ──► hasBlockingActiveRun = true
|
|
├─> if blocking ──► RETURN created=false (no enqueue)
|
|
└─> else ──► unshift record, write file, return true
|
|
├─> if run is null ──► RETURN null (caller drops the tick)
|
|
└─> else ──► commitPreparedAutonomyTurn(prepared) (heartbeat last-run state ONLY now mutates)
|
|
└─> assemble QueuedCommand and return
|
|
```
|
|
|
|
Two structural moves: (a) preparing the prompt no longer commits heartbeat state; only successful run insertion commits it. (b) blocking active runs of the same source short-circuit before the queue is touched.
|
|
|
|
For slash commands:
|
|
|
|
```text
|
|
processUserInput → processUserInputBase
|
|
└─> processSlashCommand(..., autonomy = cmd.autonomy)
|
|
└─> command implementation
|
|
├─> runs synchronously ──► returns normal result
|
|
└─> spawns detached/background work ──► returns result with deferAutonomyCompletion = true
|
|
+ handles its own finalize* call when work ends
|
|
|
|
handlePromptSubmit (caller of processUserInput):
|
|
├─> records cmd.autonomy.runId in autonomyRunIds
|
|
├─> on result with deferAutonomyCompletion=true: adds runId to deferredAutonomyRunIds
|
|
└─> finalize loop: skips deferred ids in BOTH success and error branches
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Data flow
|
|
|
|
### `runs.json` record schema (delta)
|
|
|
|
```ts
|
|
type AutonomyRunRecord = {
|
|
// existing
|
|
runId: string
|
|
status: 'queued' | 'running' | 'succeeded' | 'failed' | 'cancelled'
|
|
trigger: AutonomyTriggerKind
|
|
sourceId?: string
|
|
ownerKey?: string
|
|
// new
|
|
ownerProcessId?: number // process.pid at create time and at markRunning time
|
|
ownerSessionId?: string // getSessionId() at the same points
|
|
// ...
|
|
}
|
|
```
|
|
|
|
Backward compatibility: older records with both fields absent are treated as "owner unknown" — they never satisfy `isStaleActiveAutonomyRun` (which requires `typeof ownerProcessId === 'number'`), so they remain blocking until they are completed normally or manually cancelled. This is intentional: we cannot prove they are stale.
|
|
|
|
### Stale-recovery rule
|
|
|
|
```text
|
|
isStaleActiveAutonomyRun(run) ⇔
|
|
run.status ∈ {queued, running}
|
|
∧ typeof run.ownerProcessId === 'number'
|
|
∧ !isProcessRunning(run.ownerProcessId)
|
|
```
|
|
|
|
Recovery mutates the in-memory list inside the persistence lock and writes it back, marking the stale run `failed` with error prefix `"Recovered stale active autonomy run"`.
|
|
|
|
### Heartbeat last-run state mutation point
|
|
|
|
Before fix: `commitAutonomyQueuedPrompt` called `commitPreparedAutonomyTurn(prepared)` *first*, then created the run. A skipped duplicate already advanced heartbeat last-run timestamps.
|
|
|
|
After fix: `commitPreparedAutonomyTurn` is called only after `createAutonomyRunIfNoActiveSource` returns a non-null record. Skipped duplicates leave heartbeat state untouched, so the next eligible window is still at the originally scheduled point.
|
|
|
|
---
|
|
|
|
## 7. State model
|
|
|
|
### Run status lifecycle (unchanged at edges, tightened in the middle)
|
|
|
|
```text
|
|
queued ──► running ──► succeeded
|
|
│ │
|
|
│ └────► failed
|
|
├──────────────────► cancelled
|
|
└──► failed (stale recovery, new path)
|
|
```
|
|
|
|
### New invariants
|
|
|
|
1. **Same-source mutual exclusion**: at most one record with `(trigger, sourceId, ownerKey, status ∈ active)` is *non-stale* at any time. Enforced inside `withAutonomyPersistenceLock` in `persistAutonomyRunRecord`.
|
|
|
|
2. **Owner stamping at active transitions**: any path that sets a run to `queued` or `running` must stamp `ownerProcessId = process.pid` and `ownerSessionId = getSessionId()`. `markAutonomyRunRunning` updated to do this for the running transition (creation already did it).
|
|
|
|
3. **Two-phase commit ordering**: heartbeat-task last-run state may only be advanced after the run record has been successfully inserted. Equivalent to "prompt commit ⇒ run row exists".
|
|
|
|
4. **Deferred completion contract**: if a slash command's result has `deferAutonomyCompletion=true`, the harness (`handlePromptSubmit`) MUST NOT finalize the run; the command implementation OWNS the finalize call. Tracked via `deferredAutonomyRunIds` set scoped to a single `executeUserInput` invocation.
|
|
|
|
### Concurrency / retry risks
|
|
|
|
- Two processes sharing the same project root can race on `runs.json`. Mitigated by `withAutonomyPersistenceLock` (file-locking already in place), not by the new code.
|
|
- Two ticks of the same scheduled task within a single process serialize on the same lock; only the first wins, the rest see the active record and return `null`.
|
|
- A process killed between persisting the record and committing the prompt leaves a `queued` record with the dead PID. Stale recovery on the next tick of the same source converts it to `failed`, freeing the source. This is the new safety net.
|
|
|
|
### Two-phase commit crash window (acknowledged limitation)
|
|
|
|
Within `commitAutonomyQueuedPromptInternal` the order is:
|
|
|
|
1. `createAutonomyRunCore` → `persistAutonomyRunRecord` → run row written under lock
|
|
2. `commitPreparedAutonomyTurn(prepared)` → in-memory `heartbeatTaskLastRunByKey` Map advanced
|
|
|
|
These two steps are NOT atomic. If the process is killed between (1) and (2):
|
|
|
|
- `runs.json` has a fresh `queued` record stamped with the now-dead PID.
|
|
- `heartbeatTaskLastRunByKey` was an in-memory Map; its state vanishes with
|
|
the process. On restart the Map is empty.
|
|
- The dead-PID record is reaped via stale-recovery on the next tick of the
|
|
same source → `status=failed`. New record can be created.
|
|
- Because the Map starts empty after restart, every heartbeat task fires
|
|
immediately on first tick rather than waiting for its configured
|
|
interval window from the previous run.
|
|
|
|
**Severity**: low. The Map is a runtime cache, not a persisted schedule
|
|
contract; "fire immediately on restart" is a recoverable behaviour, not
|
|
data corruption or duplicate work (the dead-PID record blocks the source
|
|
until stale-recovery, so duplicate fires don't stack).
|
|
|
|
**Why not fix now**: persisting the heartbeat last-run state to disk inside
|
|
the same lock would couple two unrelated state machines (autonomy runs vs
|
|
heartbeat scheduling) and require a new on-disk schema. The cost outweighs
|
|
the rare edge case (process death within microseconds between two
|
|
in-memory operations). Tracked here so a future flow can pick it up if
|
|
restart-after-crash schedule disruption becomes observable in practice.
|
|
|
|
---
|
|
|
|
## 8. Existing tests
|
|
|
|
### Pre-fix
|
|
|
|
- `src/utils/__tests__/autonomyRuns.test.ts` covered create / list / mark transitions for the basic happy path.
|
|
- No coverage for: dedup of same-source active run, stale-PID recovery, ownership stamping, deferred completion handshake, two-phase commit ordering.
|
|
- `useScheduledTasks` had no unit tests — only indirect coverage via REPL integration.
|
|
- `processSlashCommand` had no autonomy-context coverage.
|
|
|
|
### Added in this branch
|
|
|
|
- `src/utils/__tests__/autonomyRuns.test.ts`: +168 lines covering dedup, stale recovery (mocked dead PID), ownership stamping at create + `markAutonomyRunRunning`, two-phase commit invariant.
|
|
- `src/hooks/__tests__/useScheduledTasks.test.ts`: new file, 75 lines. Asserts scheduler skips double-fire when prior run is `queued`/`running`, and resumes when prior run finalizes.
|
|
- `src/utils/processUserInput/__tests__/processSlashCommand.test.ts`: new file, ~280 lines. Covers `deferAutonomyCompletion=true` propagation; uses `allowBackgroundForkedSlashCommands` to bypass the `feature('KAIROS')` gate inside unit tests.
|
|
|
|
### Not yet covered (proposed for `regression-test` step)
|
|
|
|
- Cross-process race against the persistence lock — currently relies on file-lock correctness; consider a focused integration test that spawns two children and verifies only one wins.
|
|
- Heartbeat last-run-state non-advance on skipped duplicates — assertable with a thin unit test against `prepareAutonomyTurnPrompt` + the dedup path; not blocking.
|
|
|
|
---
|
|
|
|
## 9. Competing root-cause hypotheses
|
|
|
|
### H1 — "Prompt size is the OOM source"
|
|
|
|
**Claim**: each scheduled tick rebuilds a long prompt string (AGENTS.md + HEARTBEAT.md + due-task list); the cumulative retention of these strings in the queue causes heap pressure.
|
|
|
|
**Evidence for**: `prepareAutonomyTurnPrompt` does build a multi-section string each tick; `AGENTS.md` in this repo is now 220 lines.
|
|
|
|
**Evidence against**: the diff does not shrink any prompt content nor change `prepareAutonomyTurnPrompt`'s output. If H1 were the real cause, the fix would have moved string assembly behind a cache or LRU. The fix instead targets the *number* of in-flight runs.
|
|
|
|
**Verdict**: contributing factor at most. Rejected as primary root cause.
|
|
|
|
### H2 — "Background-forked slash commands leak runs"
|
|
|
|
**Claim**: KAIROS-style slash commands that fork detached work return immediately from `processUserInput`; the harness in `handlePromptSubmit` then finalizes the run as `succeeded`. Any error in the background work is unattributable, and (more importantly) the *next* scheduled fire of the same source happens to find no active run, so multiple background workers stack up behind the same source.
|
|
|
|
**Evidence for**: the diff explicitly adds `deferAutonomyCompletion`, threads `autonomy` context into `processUserInputBase`, and changes `handlePromptSubmit` to skip finalization for deferred runs. New test file `processSlashCommand.test.ts` is dedicated to this exact handshake.
|
|
|
|
**Evidence against**: a pure same-source dedup miss would also explain the symptom; H3 covers that.
|
|
|
|
**Verdict**: real and load-bearing. Confirmed by the targeted code added.
|
|
|
|
### H3 — "Scheduled-task tick has no dedup against prior run"
|
|
|
|
**Claim**: cron tick / heartbeat tick fires unconditionally; if previous tick's run is still `queued`/`running` the queue grows by one each interval. Compounded across multiple sources, queue + `runs.json` active subset never shrink.
|
|
|
|
**Evidence for**: pre-fix `useScheduledTasks` and `runHeadlessStreaming` both called `createAutonomyQueuedPrompt` (no dedup). Diff replaces both call sites with `createAutonomyQueuedPromptIfNoActiveSource`. Persistence-side dedup added in the same change.
|
|
|
|
**Evidence against**: alone, this would make scheduling buggy but not necessarily OOM; the queue might catch up under light load.
|
|
|
|
**Verdict**: real and load-bearing. Confirmed by the targeted code added.
|
|
|
|
### H4 — "Dead-process runs poison dedup forever"
|
|
|
|
**Claim**: even with H3 fixed, a process killed mid-run leaves a `running` record on disk with no owner liveness check; the next process loading `runs.json` would treat it as blocking and never schedule that source again.
|
|
|
|
**Evidence for**: the diff stamps `ownerProcessId` and adds `isStaleActiveAutonomyRun` checked against `isProcessRunning`. Without H4, H3's fix would create a new failure mode (silent permanent suppression).
|
|
|
|
**Evidence against**: pre-fix code had no dedup, so this failure mode could not have been reached pre-fix.
|
|
|
|
**Verdict**: real, but secondary. It exists because H3's fix introduces it. Required to ship together.
|
|
|
|
---
|
|
|
|
## 10. Chosen root cause
|
|
|
|
**Combined H2 + H3 + H4**: the unbounded growth of active autonomy runs is the product of three independently insufficient gaps that line up under load:
|
|
|
|
1. Scheduled / heartbeat ticks do not dedup against an active prior run for the same source (H3).
|
|
2. Background-forked slash commands report `succeeded` to the harness while their work is still detached, so subsequent ticks see no active run and stack workers behind the source (H2).
|
|
3. Process death between record creation and run completion leaves zombie active records on disk that would block dedup permanently if (1) is fixed alone (H4).
|
|
|
|
Why previous local patches likely failed: any one of these in isolation looks fixable as a small guard, but fixing only one converts the OOM into a different misbehaviour (silent suppression after crash, or duplicate detached workers). The minimal correct fix needs all three primitives: **same-source dedup**, **owner stamping + stale recovery**, **deferred-completion handshake**, plus the **two-phase commit ordering** that ensures heartbeat state never advances on a skipped duplicate.
|
|
|
|
---
|
|
|
|
## 11. Fix plan
|
|
|
|
### Minimal fix surface
|
|
|
|
| Module | Change | Reason |
|
|
|---|---|---|
|
|
| `autonomyRuns.ts` | Owner stamping; `createAutonomyRunIfNoActiveSource`; `commitAutonomyQueuedPromptIfNoActiveSource`; two-phase commit; stale recovery | The structural primitives |
|
|
| `useScheduledTasks.ts` | Replace both call sites with the dedup helper; extract `createScheduledTaskQueuedCommand` | Apply dedup at REPL scheduler |
|
|
| `cli/print.ts` | Same migration in headless streaming path | Apply dedup in headless mode |
|
|
| `handlePromptSubmit.ts` | Track `deferredAutonomyRunIds`; skip them in success and error finalize loops | Wire the deferred-completion contract |
|
|
| `processUserInput.ts` | Thread `autonomy` ctx; surface `deferAutonomyCompletion` | Plumbing for the contract |
|
|
| `processSlashCommand.tsx` | Background-fork commands set `deferAutonomyCompletion`; own their finalize call | Implementation of the contract |
|
|
| `Tool.ts` | `allowBackgroundForkedSlashCommands` flag on `ToolUseContext.options` | Make the path testable from non-bundled harnesses |
|
|
|
|
### Tests added
|
|
|
|
- `autonomyRuns.test.ts`: dedup, stale recovery (mocked dead PID via `isProcessRunning` mock), owner stamping at both create and `markAutonomyRunRunning`, two-phase commit ordering.
|
|
- `useScheduledTasks.test.ts`: scheduler skips double-fire, resumes after finalize.
|
|
- `processSlashCommand.test.ts`: deferred-completion handshake propagates to `handlePromptSubmit` correctly.
|
|
|
|
### Compatibility / migration risk
|
|
|
|
- Older `runs.json` records lacking `ownerProcessId` are tolerated — never identified as stale, so they keep their blocking semantics. Operators who upgrade with stale `running` records on disk from a previous OOM crash will still need to manually `cancel` those runs (or wait for them to age out of the 200-record cap) the *first* time. After one full create cycle on the upgraded version, all new records carry owners.
|
|
- **Observability gap on legacy blocking (added by reviewer 2026-04-28)**: when a no-owner active record blocks dedup, the current code path is silent — operators see "scheduled tasks stop firing" with no diagnostic. `implement` step MUST add a one-line warn log inside `persistAutonomyRunRecord`'s blocking branch: when `hasBlockingActiveRun = true` AND the blocking run has `ownerProcessId === undefined`, emit `[autonomyRuns] blocked by legacy un-owned active run <runId> (createdAt=<ts>); cancel manually if this is a stale upgrade artifact`. ≤ 10 lines of code, converts silent hang into a diagnosable signal. Do **not** change behavior — just observability.
|
|
- `ToolUseContext.options.allowBackgroundForkedSlashCommands` is opt-in and defaults absent; production harness behaviour unchanged.
|
|
- No on-disk schema version bump required.
|
|
|
|
### Rollback plan
|
|
|
|
- Revert the working tree to `main`'s versions of all 8 files. The `runs.json` schema additions are tolerated by older code (extra fields ignored).
|
|
- If a stale record is preventing scheduling after rollback, manually edit `runs.json` (status → `cancelled`) or run `/autonomy flow cancel` for affected flows.
|
|
- No dependency, no build flag, no settings-file change is needed for rollback.
|
|
|
|
### Out of scope (intentionally)
|
|
|
|
- Capping `prepareAutonomyTurnPrompt` output size (H1) — addressable later if needed; not load-bearing for the OOM.
|
|
- Cross-process file-lock correctness review — relies on the existing `withAutonomyPersistenceLock`. Out of scope for this flow.
|
|
- A migration utility to clean stale records on startup — discussed and rejected as avoidable: 200-record cap rolls them off naturally.
|
|
|
|
---
|
|
|
|
## 12. Verification
|
|
|
|
### Commands (binding per `.claude/autonomy/AGENTS.md` §4)
|
|
|
|
```bash
|
|
bun run typecheck
|
|
bun test src/utils/__tests__/autonomyRuns.test.ts
|
|
bun test src/hooks/__tests__/useScheduledTasks.test.ts
|
|
bun test src/utils/processUserInput/__tests__/processSlashCommand.test.ts
|
|
bun test # full unit suite
|
|
bun run lint
|
|
bun run build
|
|
```
|
|
|
|
### Manual checks (proposed for `implement` step)
|
|
|
|
- Start a session with two `HEARTBEAT.md` 30s tasks for ≥ 30 minutes; observe `runs.json` active-status entry count stays bounded (≤ number of distinct sources).
|
|
- Force-kill the Bun process during a `running` record. Restart. Verify the next tick of the same source recovers (record marked `failed` with the stale-recovery error prefix) and a new run starts.
|
|
- Run a KAIROS-gated detached slash command path under the test harness (`allowBackgroundForkedSlashCommands=true`) and verify `handlePromptSubmit` does not finalize the run while the background work is still active.
|
|
|
|
### Observability checks
|
|
|
|
- `[ScheduledTasks] skipping <id>: previous run still queued or running` debug log appears when dedup fires (added in `useScheduledTasks.ts`). Use it to confirm dedup is reached in real sessions.
|
|
- `runs.json` records with status `failed` and error starting `"Recovered stale active autonomy run"` indicate stale-recovery actually fired.
|
|
|
|
---
|
|
|
|
## 13. Open questions
|
|
|
|
1. ~~Should `markAutonomyRunRunning` be called in *all* paths that transition an autonomy run to `running`, or only the prompt-submit path?~~ **Closed (verified 2026-04-28).**
|
|
`markAutonomyRunRunning` (`autonomyRuns.ts:554-579`) is the **only** function that transitions `AutonomyRunRecord.status → 'running'`. It stamps `ownerProcessId = process.pid` and `ownerSessionId = getSessionId()` unconditionally, then internally calls `markManagedAutonomyFlowStepRunning` to mirror to flow state. `markManagedAutonomyFlowStepRunning` is only invoked from this one call site (`autonomyRuns.ts:571`); no caller bypasses the stamp. All four real callers (`cli/print.ts:2177`, `screens/REPL.tsx:4859`, `utils/handlePromptSubmit.ts:492`, `utils/swarm/inProcessRunner.ts:741`) go through the stamping path. Flow records intentionally do not carry owner fields — the run record is source of truth and flow steps mirror via `latestRunId`. Stale-recovery operates on runs, so flow-step runs are covered.
|
|
2. ~~`getSessionId()` import was added to `autonomyRuns.ts`. Confirm no circular import is introduced...~~ **Closed (verified 2026-04-28).**
|
|
No risk on three counts: (a) `autonomyRuns.ts:4` already imported `getProjectRoot` from `bootstrap/state.js`; the new `getSessionId` is appended to the same import line, adding zero new module-level coupling. (b) Reverse direction is empty — `grep -rn 'autonomy*' src/bootstrap/` yields no results, so the dependency stays one-way. (c) `getSessionId()` (`bootstrap/state.ts:425-427`) returns `STATE.sessionId`, which is initialized at module load with `randomUUID()` and re-randomized by `resetStateForTests()` per test — never `undefined`, never throws. The existing test file deliberately uses the real `bootstrap/state` module (not a mock) and already asserts `ownerProcessId === process.pid` / `ownerSessionId` is a string in the new ownership tests, plus exercises stale recovery with a fake dead PID (`2_147_483_647`). No mock updates needed.
|
|
3. Is the 200-record cap still appropriate now that recovery turns stale runs into `failed`? Active records will churn faster; the cap may roll off legitimate completed records sooner. Not a correctness issue, but worth noting.
|
|
|
|
---
|
|
|
|
## 14. Approval gate
|
|
|
|
This SUR satisfies `AGENTS.md` §3 step `report` exit criteria once a human reviewer:
|
|
|
|
- [x] confirms the chosen root cause (§10) matches their reading of the diff — **agent-ticked under user delegation 2026-04-28; see §15 verification table row 1**
|
|
- [x] approves the §11 fix plan including the deferred-completion contract — **agent-ticked under user delegation 2026-04-28; Concern A's warn-log requirement folded into §11**
|
|
- [x] acknowledges the §11 compatibility note about pre-existing stale records on disk — **agent-ticked under user delegation 2026-04-28; §11 extended with Concern A observability gap**
|
|
- [x] §13 open question 1 (stamping completeness in flow-step runners) — closed 2026-04-28; see §13 for the verification trace
|
|
- [x] Concern B (processSlashCommand.tsx >50% diff) — **resolved 2026-04-28 by commit-split rule, see §15**
|
|
|
|
---
|
|
|
|
## 15. Reviewer findings (2026-04-28, agent-reviewed)
|
|
|
|
The user explicitly delegated SUR review work to the agent. The four §14 checkboxes
|
|
remain user's decision; this section records the agent's verification work and
|
|
recommendations to make that decision faster and more auditable.
|
|
|
|
### Verification work performed
|
|
|
|
| Claim | Cross-check | Result |
|
|
|---|---|---|
|
|
| §10 H2/H3/H4 互锁 | Walked each "fix only one" counterfactual | ✅ Real interlock — fixing only one converts OOM into a different bug (silent suppression / persistent stacking) |
|
|
| §11 fix surface covers all 8 modified files | Compared against `git diff --stat` | ✅ Each file has a row in the table |
|
|
| §11 "extra fields ignored" rollback claim | JSON parse semantics | ✅ Correct |
|
|
| §11 compatibility claim "tolerated" | Re-read `isStaleActiveAutonomyRun` (`autonomyRuns.ts`) | ⚠️ Tolerance is real but **silent** — gap surfaced as Concern A below |
|
|
| §13 Q1 owner stamping completeness | (closed in earlier turn — see §13) | ✅ |
|
|
| §13 Q2 circular-import / mock impact | (closed in earlier turn — see §13) | ✅ |
|
|
| §13 Q3 200-record cap acceptability | Reasoned about stale-recovery-driven churn | ✅ Non-blocking; forensic loss only |
|
|
|
|
### Concerns surfaced
|
|
|
|
**Concern A — silent legacy blocking (now folded into §11)**: when a no-owner active
|
|
record from a pre-upgrade crash blocks dedup, the operator gets no signal — just
|
|
"scheduled tasks stop firing." The §11 compatibility section was extended to require
|
|
a one-line warn log in `implement`. This is an observability fix, not a behavior
|
|
change.
|
|
|
|
**Concern B — `processSlashCommand.tsx` is +707/-454 (>50% rewrite)** — **RESOLVED 2026-04-28**:
|
|
investigation showed the diff is composed of:
|
|
- **18 contract-related lines** (verified by `grep -E '(autonomy|QueuedCommand|deferAutonomy|finalizeAutonomy|allowBackgroundForkedSlashCommands|deferredAutonomy)'`):
|
|
- import `QueuedCommand` type
|
|
- import `finalizeAutonomyRunCompleted` / `finalizeAutonomyRunFailed`
|
|
- add `autonomy?: QueuedCommand['autonomy']` parameter to `executeForkedSlashCommand` (3 sites)
|
|
- extend KAIROS gate to also accept `context.options.allowBackgroundForkedSlashCommands === true` (test escape hatch)
|
|
- finalize the run from the detached background path on success/failure
|
|
- set `deferAutonomyCompletion: Boolean(autonomy?.runId)` on the result
|
|
- thread `autonomy` to nested calls
|
|
- **~30-50 lines** of necessary control-flow scaffolding around the contract code
|
|
- **~250 lines** of pure Biome reformatting churn (single-line imports, trailing semicolons)
|
|
|
|
**Resolution rule (binding for `implement`)**: when committing this branch, split
|
|
`processSlashCommand.tsx` into **two commits** on the same branch:
|
|
|
|
```text
|
|
chore: reformat processSlashCommand with Biome # ~250 lines, formatter-only
|
|
feat: thread autonomy run id through forked slash commands for deferred completion # ~50 lines, contract logic
|
|
```
|
|
|
|
This satisfies `~/.claude/rules/deep-debug/core.md` §2 ("bug fix 不允许混入...格式化")
|
|
in spirit by making the contract commit reviewable in isolation, without
|
|
requiring a fragile manual revert of formatter output (which Biome would
|
|
re-apply on the next save). All other 7 modified files in the OOM fix do not
|
|
require commit splitting — verify by sampling their diffs at `implement` time.
|
|
|
|
**Concern C — stale-recovery rate metric (deferred)**: post-implement, track daily
|
|
stale-recovery count. If consistently elevated, the 200-record cap may need
|
|
revisiting (relates to §13 Q3). Not a blocker; suggested for follow-up flow.
|
|
|
|
### Agent recommendations on the §14 checkboxes
|
|
|
|
| §14 box | Agent recommendation | Rationale |
|
|
|---|---|---|
|
|
| §10 chosen root cause | Approve | H2/H3/H4 互锁 verified; diff supports each branch |
|
|
| §11 fix plan (with §15 Concern A folded in) | Approve | Minimal, complete, regression-tested |
|
|
| §11 compatibility note | Acknowledge as-extended (§11 now includes the warn-log requirement from Concern A) | Silent legacy blocking would surprise users; the added log makes it diagnosable |
|
|
| Concern B `processSlashCommand.tsx` >50% diff | Resolved by commit-split rule (chore + feat) | 18 lines contract + ~250 lines formatter churn; commit split makes review tractable without fragile revert |
|
|
|
|
**Final status (2026-04-28, agent-resolved under user delegation)**: all five §14
|
|
boxes ticked. Flow `recurring-bug-loop-oom` may advance from `report` to
|
|
`regression-test`. Implement-time obligations folded in:
|
|
|
|
1. Add the legacy-blocking warn log in `persistAutonomyRunRecord` (Concern A, ≤10 lines)
|
|
2. Commit-split `processSlashCommand.tsx` into chore + feat (Concern B)
|
|
3. Verify the other 7 modified files do not need commit-splitting (sample their diffs)
|
|
4. Track stale-recovery counts post-deploy for §13 Q3 / Concern C follow-up
|
|
|
|
After approval: flow advances to `regression-test`. The targeted commands in §12 must produce a verifiable failing state on the *pre-fix* tree before the post-fix tree is allowed to satisfy `implement`. Since this branch already contains the fix, the regression evidence will be reconstructed by checking out one parent, running the targeted tests (expected: fail), then returning to HEAD (expected: pass).
|