Skip to content

fix(running-in-ci): trim CI-poll loops to fit Bash tool's 10-min cap#695

Merged
max-sixty merged 2 commits into
mainfrom
fix/issue-694
Jun 20, 2026
Merged

fix(running-in-ci): trim CI-poll loops to fit Bash tool's 10-min cap#695
max-sixty merged 2 commits into
mainfrom
fix/issue-694

Conversation

@tend-agent

Copy link
Copy Markdown
Collaborator

Problem

The bundled running-in-ci skill's foreground CI-poll recipe (and the structurally-identical gh run rerun --failed rollup loop) iterated for i in $(seq 1 15); do sleep 60; ...; done — a ≥15-minute wall clock if CI didn't finish early. The Claude Code Bash tool's max configurable timeout is 600000 ms (10 min); past that the harness auto-backgrounds the call. The skill's own policy says background-completion notifications are not reliably delivered to a CI session, so the gated follow-up (dismiss approval on failure, post failure analysis) can't fire — and the bot's sleep workarounds get blocked by the harness's anti-polling guards, ending the session before the dismissal runs.

#694 documents six occurrences in a single 24h window on max-sixty/worktrunk (six of ~20 token-bearing tend-review runs — ~30%). All six PRs merged cleanly so the loss-of-dismissal path was not exercised, but the failure mode is structural and would fail open on a future red-CI run.

This is the same root-cause class as #674 (triage's full-suite gate exceeding the 10-min cap) but in a different code path — the gated CI dismissal path on review.

Solution

Trim both loops from seq 1 15 to seq 1 9. With 60s sleeps per iteration plus the at-most-once 30s grace re-check, the worst case sits comfortably inside the 600000 ms Bash cap so the loop runs to completion and the gated follow-up fires.

Also added a one-line note that callers must invoke Bash with timeout: 600000 — the default 2-min Bash timeout would kill even the trimmed loop early.

Updated the inline references that cited the old 15-minute figure ("Polling for it deadlocks until the 15-min cap breaks it""loop cap", "CI still running after 15 minutes""9 minutes", "up to ~15 minutes""up to ~9 minutes").

Testing

Skill text fix — no automated test. Verified by reading the recipe end-to-end: 9 iterations × 60s + one 30s grace + small gh/jq overhead per iteration stays under 600s. The reporter's evidence (six session log traces showing the auto-background sequence) is the reproduction.


Closes #694 — automated triage

The foreground CI-poll recipe and the rerun-jobs poll both ran
`for i in $(seq 1 15); do sleep 60; ...; done` — a minimum of 15
minutes if CI didn't finish early. The Claude Code Bash tool's
max configurable timeout is 600000 ms (10 min), past which the
harness auto-backgrounds the call and blocks foreground `sleep`
waits. Once that happens the gated follow-up (dismiss approval
on failure, post failure analysis) can't fire in-session —
background-completion notifications are not reliably delivered
to a CI session, per the same skill's own policy.

Trim both loops to `seq 1 9` (≥9 min minimum) so the recipe
fits inside the harness cap, and add a note that callers must
invoke Bash with `timeout: 600000` (the default 2-min timeout
would kill the loop early).

Closes #694

Co-Authored-By: Claude <noreply@anthropic.com>

@tend-agent tend-agent left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self-review of the trim. Math checks out (9 × 60s + at most one 30s grace ≈ 570s, comfortably under the 600s Bash cap) and the inline 15-min/15 minutes references are all updated. One gap: the new timeout: 600000 caller note was only added next to the CI Monitoring loop, not the structurally-identical gh run rerun --failed rollup loop below — see inline. A reader who jumps straight to the rerun recipe still hits the 2-min default Bash timeout.

Comment thread plugins/tend-ci-runner/skills/running-in-ci/SKILL.md
Self-review caught that the timeout: 600000 caller note was only
added next to the CI Monitoring loop, not the structurally-identical
gh run rerun --failed rollup loop below. A reader following just the
rerun recipe would still hit the default 2-min Bash timeout.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@tend-agent

Copy link
Copy Markdown
Collaborator Author

Two more occurrences from a different repo, tonight on PRQL/prql. Both are interactive-harness (max-sixty/tend/interactive@0.1.4) reviews of open Dependabot PRs where the bot followed the bundled pending() recipe verbatim with no run_in_background flag (foreground intent, timeout: 11000001200000). Both sessions ended around the 12-minute mark with PTY tail 1 shell still running — the harness moved the Bash to its background tasks dir (/tmp/claude-1002/.../tasks/<id>.output) and the Stop hook fired.

Run PR Wall Final assistant text Outcome
27643779196 PRQL/prql#6013 (js-yaml 4.1.1 → 4.2.0) 11m 52s "CI poll is running in the background. I'll continue when it completes." (then ScheduleWakeup(3600s), then end_turn) No review
27644108261 PRQL/prql#6014 (Cargo patch group, 8 deps) 12m 27s "I'll wait for the CI poll to complete." (no wakeup, plain end_turn) No review

A third PR in the same Dependabot batch, PRQL/prql#6011 (insta-cmd 0.6.0 → 0.7.0), got a clean empty-body APPROVED review from the same session shape — its CI finished in ~6 min, well under the 10-min cap, so the loop returned and the verdict shipped.

Two notes on prioritization:

  1. The failure shape on these two is pre-APPROVE, not post-APPROVE. The issue body's documented failure was "if CI flips red after the auto-background, the dismiss-on-failure follow-up will silently not happen" — a fail-open hypothetical. What happened on PRQL/prql#6013 / #6014 is more visible: when the bot decides "CI is in flight, wait then approve" (per running-in-ci's "Read Context" / pre-APPROVE peek), the same auto-background cuts the wait short and no review of any kind is posted. The PRs sit open with no bot verdict, which the maintainer overlay treats as not-blocking-but-missing-the-bot's-pre-merge-check.

  2. The ScheduleWakeup fallback (#6013) didn't recover. This confirms the issue body's "background-completion notifications aren't reliably delivered" point on the interactive harness specifically — even with the wakeup explicitly scheduled, the session terminated and never resumed before the 3600s delay would have fired.

The seq 1 9 trim in this PR addresses both shapes by landing the loop's exit inside one Bash call, so the session can ship its verdict regardless of CI duration up to ~9 min. The pre-existing self-review note about the gh run rerun --failed rollup loop's timeout needing to be set on the caller still holds.

Recording these two as additional structural-confirmation evidence in the PRQL/prql review-reviewers gist — cumulative for this shape is now 3 on PRQL/prql alone, plus the 6 in the issue body on max-sixty/worktrunk.

@max-sixty max-sixty merged commit 6b03677 into main Jun 20, 2026
7 checks passed
@max-sixty max-sixty deleted the fix/issue-694 branch June 20, 2026 04:38
@max-sixty max-sixty mentioned this pull request Jun 23, 2026
max-sixty added a commit that referenced this pull request Jun 23, 2026
#721 (base workflow-regen worktree on an open PR, not branch-ref
existence) merged to main after this release branch was cut; it's a
Fixed-scope change to the bundled nightly skill that adopters run, so
it belongs in the 0.1.7 notes alongside #695.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
max-sixty added a commit that referenced this pull request Jun 23, 2026
Bumps generator version to 0.1.7, syncs the lockfile, and adds the
`## 0.1.7` CHANGELOG section (published verbatim as the GitHub Release
notes on tag push).

Changes since 0.1.6: pin actions/checkout to v7 with review opting into
the fork-PR checkout guard (#725); bump claude-code to 2.1.185 (#719);
running-in-ci surfaces blocking scope rules instead of routing around
them (#717) and caps CI-poll loops to the Bash 10-min limit (#695);
nightly workflow-regen bases its worktree on an open PR rather than
branch-ref existence (#721); de-duplicate composite-action step bodies
under shared/steps/ with harness-named action paths (#712); correct the
codex effort list (#710); review-reviewers and worker-deploy doc fixes
(#707, #711).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

running-in-ci: foreground CI poll's 15-min loop exceeds the Bash tool's 10-min cap, gated dismissal can't fire

2 participants