Forward termination to steps when a user cancels a resumable run#33925
Forward termination to steps when a user cancels a resumable run#33925DRACULA1729 wants to merge 1 commit into
Conversation
With run monitoring set up and maxResumeRunAttempts > 0, the step-delegating executor treated every termination signal as a pending resume and skipped forwarding it to in-flight steps. A user-initiated cancellation got swallowed the same way a pod crash would, so the running step never stopped. run_will_resume now returns False once a run is CANCELING or CANCELED. The daemon only ever resumes runs that are still in STARTED, so those two statuses were never actually resumable in the first place. Fixes dagster-io#33923.
Greptile SummaryThis PR fixes a bug where cancelling a resumable run from the UI left in-flight steps running indefinitely. The root cause was that
Confidence Score: 5/5Safe to merge — the change is a narrow, targeted guard in run_will_resume with no side effects on existing crash-recovery behaviour. The fix correctly identifies the two statuses (CANCELING and CANCELED) that the monitoring daemon never actually resumes, and the new early-return precisely mirrors the status check already performed elsewhere in the execute_run finally block. The additional get_run_by_id call is well within acceptable overhead, the existing module-level import pattern is followed, and the two new unit tests exercise both affected branches end-to-end using the real instance fixture. No files require special attention.
|
| Filename | Overview |
|---|---|
| python_modules/dagster/dagster/_core/instance/methods/run_launcher_methods.py | Adds CANCELING/CANCELED status guard to run_will_resume so user-initiated cancellations are not treated as crash-recovery scenarios, correctly forwarding SIGTERM to in-flight steps. |
| python_modules/dagster/dagster_tests/daemon_tests/test_monitoring_daemon.py | Adds two focused unit tests covering the CANCELING and CANCELED status branches of the new guard in run_will_resume; both tests correctly use the existing instance fixture with max_resume_run_attempts=3. |
Sequence Diagram
sequenceDiagram
participant UI as User (UI)
participant Daemon as Monitoring Daemon
participant Executor as Step-Delegating Executor
participant Steps as In-flight Steps
UI->>Daemon: Cancel run request
Daemon->>Daemon: Set run status → CANCELING
Daemon->>Executor: Send SIGTERM to pod
Executor->>Executor: check_for_interrupts() → True
Executor->>Executor: run_will_resume(run_id)
Note over Executor: get_run_by_id → status=CANCELING
Executor-->>Executor: return False (was: True before fix)
alt Before fix (status check missing)
Executor->>Executor: Log not forwarding, run will be resumed
Steps->>Steps: Continue running indefinitely
else "After fix status == CANCELING → False"
Executor->>Steps: Forward SIGTERM (terminate_step)
Steps->>Steps: Terminate cleanly
end
Reviews (1): Last reviewed commit: "Forward termination to steps when a user..." | Re-trigger Greptile
|
@gibsondan mind taking a look when you get a chance? It's in the run-monitoring resume path — |
Summary & Motivation
Fixes #33923.
When run monitoring is enabled with
maxResumeRunAttempts > 0, cancelling a run from the UI left the running step going. The logs showed:The step-delegating executor decides whether to forward a termination signal by calling
run_will_resume, but that method only looked at whether monitoring was enabled and whether resume attempts were left — never the run status. So a user cancel was indistinguishable from a pod crash, and the signal got swallowed.A cancel puts the run into
CANCELING(orCANCELEDon force-terminate) before the daemon sends SIGTERM, andmonitor_started_runonly ever resumes runs still inSTARTED. So those statuses were never actually resumable.run_will_resumenow returnsFalsefor them, which is the same status check_resume_from_failure/theexecute_runfinally block already does before falling back torun_will_resume.Test Plan
Added two unit tests in
test_monitoring_daemon.py:Full
test_monitoring_daemon.pypasses (10 tests);ruffclean.