Support workflow terminate rollback#14465
Conversation
🦋 Changeset detectedLatest commit: 6c313a6 The changes in this PR will be included in the next version bump. This PR includes changesets to release 7 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
4614a06 to
f539470
Compare
|
✅ All changesets look good |
4ccbf75 to
01fd44d
Compare
@cloudflare/autoconfig
create-cloudflare
@cloudflare/deploy-helpers
@cloudflare/kv-asset-handler
miniflare
@cloudflare/pages-shared
@cloudflare/unenv-preset
@cloudflare/vite-plugin
@cloudflare/vitest-pool-workers
@cloudflare/workers-auth
@cloudflare/workers-editor-shared
@cloudflare/workers-utils
wrangler
commit: |
f9b1b1a to
9ed9edf
Compare
|
@vaishnav-mk Bonk workflow was cancelled. View workflow run · To retry, trigger Bonk again. |
50bf663 to
2c0a39a
Compare
4bafb21 to
ef038d9
Compare
|
Codeowners approval required for this PR:
Show detailed file reviewers
|
ef038d9 to
b1be7f1
Compare
|
/bonk review |
There was a problem hiding this comment.
Review: Support workflow terminate rollback
Overall this is a well-structured change. The replay-to-rebuild-registry approach correctly addresses the "empty registry after DO restart/pause" limitation that Devin originally flagged, and the persisted hasRollback metadata is a clean way to know which cached steps to replay. Tests cover the live-registry, replayed-registry, and paused-registry cases. A few things worth addressing before merge.
Blocking / needs discussion
1. The remote (production) API path also sends rollback: true — is this intended/supported?
packages/wrangler/src/workflows/utils.ts:131 adds rollback: true to the body of the remote PATCH /accounts/.../instances/:id/status call. The PR description scopes this to "local Workflows tooling," and the engine implementation lives entirely in @cloudflare/workflows-shared (local). If the production Workflows REST API doesn't yet accept a rollback field, this either silently no-ops or could be rejected — yet the CLI unconditionally prints 🥷 The instance "..." was terminated successfully (terminate.ts:65). If the remote API isn't ready, consider gating --rollback to --local only (and erroring otherwise), so users don't get a false success message for a rollback that never ran. If the remote API is ready, the PR description should say so.
Non-blocking
2. No user-facing signal when rollback runs vs. is skipped.
This was Devin's original concern. The replay path mitigates the empty-registry case, but if getEligibleRollbackSteps() returns nothing (no eligible steps, or malformed metadata), rollbacks are silently skipped while the CLI still reports plain success. Consider surfacing in the success log whether rollbacks executed (e.g. count of rollback steps run), so the behavior is observable.
3. replayRollbackRegistry re-runs the user workflow — confirm side-effect safety for uncached steps.
engine.ts:983 calls USER_WORKFLOW.run(...) in replay phase. Cached step.do calls correctly short-circuit to their stored value (context.ts:438-455) and sleep/sleepUntil/waitForEvent return early in replay phase, so a paused/sleeping instance halts replay safely. But if an instance is terminated while a new, uncached step.do would be the next thing the user code reaches before any sleep/wait, that closure would actually execute with real side effects during replay. In practice terminate targets running/paused instances, so this is unlikely — but a comment documenting this assumption (replay relies on all rollback-eligible steps already being cached and execution halting at the next suspension point) would help future maintainers.
4. Swallowed errors in replayRollbackRegistry and userTriggeredTerminate.
engine.ts:987 catches and discards all replay errors, and engine.ts:1015 logs rollback failures only to console.error. That matches the production "best-effort" rollback model, but a malformed user workflow throwing during replay will silently produce zero rollbacks. At minimum, consider a debug-level log so local users can diagnose "why didn't my rollback run?"
Minor / nits
engine.ts:316-319insetRollbackPhase/rollbackPhaseuse a string-literal union ("replay" | "rollback" | undefined) repeated in several signatures. A small exported type alias would reduce duplication.resources/workflows.ts:923still castshandle as unknown as WorkflowHandleto callterminate(options). Same TODO as thewrapped-binding.worker.tscast — worth a single shared note/tracking issue referencing the@cloudflare/workers-typesterminate-options gap so both casts get removed together.getEligibleRollbackStepsreads allstatesrows twice and builds two sets; fine for local volumes, just noting it scans the full log table per terminate.
Tests / changeset
- Changeset (
minorfor workflows-shared/wrangler/miniflare) and wording look correct. - Wrangler tests now assert the exact request body for both local and remote
--rollback, which is good. If point (1) results in gating--rollbackto local-only, the remote test should be updated accordingly.
|
I reviewed PR #14465 and posted a review. Summary of what I found: Strengths
Main concern (blocking/discussion)
Non-blocking Nits: repeated phase string-union (extract a type), the remaining |
| private async replayRollbackRegistry( | ||
| metadata: InstanceMetadata | ||
| ): Promise<void> { | ||
| if (this.rollbackRegistry.size > 0) { | ||
| return; | ||
| } | ||
|
|
||
| const eligible = this.getEligibleRollbackSteps(); | ||
| if (eligible.length === 0) { | ||
| return; | ||
| } | ||
|
|
||
| this.rollbackEligibleCacheKeys = new Set(eligible); | ||
| const stubStep = this.createRollbackContext(); | ||
| this.setRollbackPhase("replay"); | ||
| try { | ||
| await this.env.USER_WORKFLOW.run( | ||
| metadata.event, | ||
| stubStep as unknown as WorkflowStep | ||
| ); | ||
| } catch { | ||
| // Match the production engine: replay may stop on normal workflow control | ||
| // flow; rollback execution uses whatever handlers replay registered. | ||
| } finally { | ||
| this.setRollbackPhase(undefined); | ||
| this.rollbackEligibleCacheKeys = undefined; | ||
| } |
There was a problem hiding this comment.
🚩 Replay of failed steps won't re-register their rollback handlers
During replayRollbackRegistry, if a step previously FAILED (retries exhausted or NonRetryableError), its error is stored in DO storage. When replay hits that step in do() (packages/workflows-shared/src/context.ts:457-463), it throws the cached error immediately WITHOUT calling #registerRollback. This means after a DO restart, terminating with { rollback: true } won't execute rollbacks for steps that originally failed (even though they had rollback handlers registered during the original execution). Steps that SUCCEEDED will replay correctly from cache and re-register their rollbacks. This limitation is acknowledged by the comment at packages/workflows-shared/src/engine.ts:988-989 ("replay may stop on normal workflow control flow") and may be acceptable for the local-dev use case, but it's a behavioral difference from in-memory rollback (where the registry is already populated).
Was this helpful? React with 👍 or 👎 to provide feedback.
Adds terminate rollback support across Workflows local tooling and the remote Wrangler terminate command.
This wires
rollback: truethrough the terminate path:@cloudflare/workflows-sharedbinding and local engineworkflows instances terminate --rollbackworkflows instances terminate --local --rollbackThe production Workflows API already accepts terminate rollback; this PR adds the SDK/Wrangler client surfaces and fixes local rollback recovery so rollback can run after local engine restart/eviction.
Rollback is only valid for terminate and is only serialized when explicitly set to
true.A picture of a cute animal (not mandatory, but encouraged)