Skip to content

Forward termination to steps when a user cancels a resumable run#33925

Open
DRACULA1729 wants to merge 1 commit into
dagster-io:masterfrom
DRACULA1729:fix/forward-termination-on-user-cancel-with-resume
Open

Forward termination to steps when a user cancels a resumable run#33925
DRACULA1729 wants to merge 1 commit into
dagster-io:masterfrom
DRACULA1729:fix/forward-termination-on-user-cancel-with-resume

Conversation

@DRACULA1729

Copy link
Copy Markdown
Contributor

Summary & Motivation

Fixes #33923.

When run monitoring is enabled with maxResumeRunAttempts > 0, cancelling a run from the UI left the running step going. The logs showed:

Executor received termination signal, not forwarding to steps because run will be resumed

The step-delegating executor decides whether to forward a termination signal by calling run_will_resume, but that method only looked at whether monitoring was enabled and whether resume attempts were left — never the run status. So a user cancel was indistinguishable from a pod crash, and the signal got swallowed.

A cancel puts the run into CANCELING (or CANCELED on force-terminate) before the daemon sends SIGTERM, and monitor_started_run only ever resumes runs still in STARTED. So those statuses were never actually resumable. run_will_resume now returns False for them, which is the same status check _resume_from_failure/the execute_run finally block already does before falling back to run_will_resume.

Test Plan

Added two unit tests in test_monitoring_daemon.py:

  • a STARTED run with attempts remaining is resumable, but stops being resumable once it transitions to CANCELING
  • a CANCELED run is not resumable

Full test_monitoring_daemon.py passes (10 tests); ruff clean.

With run monitoring set up and maxResumeRunAttempts > 0, the step-delegating
executor treated every termination signal as a pending resume and skipped
forwarding it to in-flight steps. A user-initiated cancellation got swallowed
the same way a pod crash would, so the running step never stopped.

run_will_resume now returns False once a run is CANCELING or CANCELED. The
daemon only ever resumes runs that are still in STARTED, so those two statuses
were never actually resumable in the first place.

Fixes dagster-io#33923.
@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a bug where cancelling a resumable run from the UI left in-flight steps running indefinitely. The root cause was that run_will_resume only checked monitoring enablement and remaining attempt count, making a user-initiated cancel indistinguishable from a pod crash, so the step-delegating executor suppressed SIGTERM forwarding.

  • run_will_resume now fetches the run record and returns False for CANCELING and CANCELED statuses, since the monitoring daemon only ever resumes runs in STARTED status — these two states can never be picked up for resumption.
  • Two unit tests are added to test_monitoring_daemon.py verifying that a STARTED run with attempts remaining is resumable, that it stops being resumable once it transitions to CANCELING, and that a CANCELED run is not resumable.

Confidence Score: 5/5

Safe to merge — the change is a narrow, targeted guard in run_will_resume with no side effects on existing crash-recovery behaviour.

The fix correctly identifies the two statuses (CANCELING and CANCELED) that the monitoring daemon never actually resumes, and the new early-return precisely mirrors the status check already performed elsewhere in the execute_run finally block. The additional get_run_by_id call is well within acceptable overhead, the existing module-level import pattern is followed, and the two new unit tests exercise both affected branches end-to-end using the real instance fixture.

No files require special attention.

Important Files Changed

Filename Overview
python_modules/dagster/dagster/_core/instance/methods/run_launcher_methods.py Adds CANCELING/CANCELED status guard to run_will_resume so user-initiated cancellations are not treated as crash-recovery scenarios, correctly forwarding SIGTERM to in-flight steps.
python_modules/dagster/dagster_tests/daemon_tests/test_monitoring_daemon.py Adds two focused unit tests covering the CANCELING and CANCELED status branches of the new guard in run_will_resume; both tests correctly use the existing instance fixture with max_resume_run_attempts=3.

Sequence Diagram

sequenceDiagram
    participant UI as User (UI)
    participant Daemon as Monitoring Daemon
    participant Executor as Step-Delegating Executor
    participant Steps as In-flight Steps

    UI->>Daemon: Cancel run request
    Daemon->>Daemon: Set run status → CANCELING
    Daemon->>Executor: Send SIGTERM to pod

    Executor->>Executor: check_for_interrupts() → True
    Executor->>Executor: run_will_resume(run_id)
    Note over Executor: get_run_by_id → status=CANCELING
    Executor-->>Executor: return False (was: True before fix)

    alt Before fix (status check missing)
        Executor->>Executor: Log not forwarding, run will be resumed
        Steps->>Steps: Continue running indefinitely
    else "After fix status == CANCELING → False"
        Executor->>Steps: Forward SIGTERM (terminate_step)
        Steps->>Steps: Terminate cleanly
    end
Loading

Reviews (1): Last reviewed commit: "Forward termination to steps when a user..." | Re-trigger Greptile

@DRACULA1729

Copy link
Copy Markdown
Contributor Author

@gibsondan mind taking a look when you get a chance? It's in the run-monitoring resume path — run_will_resume wasn't checking run status, so a user cancel got treated like a crash and the step kept running. Small fix, tests included.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Steps not terminating for user initiated job cancellation when maxResumeRunAttempts > 0

1 participant