Skip to content

Filter worker heartbeats by per-topology timeout#8800

Open
mwkang wants to merge 1 commit into
apache:masterfrom
mwkang:8799-report-hb-topology-timeout
Open

Filter worker heartbeats by per-topology timeout#8800
mwkang wants to merge 1 commit into
apache:masterfrom
mwkang:8799-report-hb-topology-timeout

Conversation

@mwkang

@mwkang mwkang commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What is the purpose of the change

ReportWorkerHeartbeats filtered stale heartbeats using only the global supervisor.worker.timeout.secs, but the kill authority Slot.getHbTimeoutMs() uses max(topology.worker.timeout.secs, supervisor.worker.timeout.secs). A slow-but-alive worker of a topology with a longer timeout was dropped from the report one round before Slot would treat it as dead.

This computes the effective timeout per topology to match Slot: max(global, min(override, worker.max.timeout.secs)), falling back to the global timeout when the topology conf can't be read.

How was this patch tested

Added unit tests for the per-topology override, the worker.max.timeout.secs cap, and the conf-read-failure fallback. storm-server build and checkstyle pass.

Closes #8799

ReportWorkerHeartbeats dropped stale local heartbeats using only the
global supervisor.worker.timeout.secs. The worker-kill authority,
Slot.getHbTimeoutMs(), instead uses
max(topology.worker.timeout.secs, supervisor.worker.timeout.secs).
A slow-but-alive worker of a topology that raised its own timeout was
therefore excluded from the report one round before Slot would treat
it as dead.

Compute the effective timeout per topology to mirror Slot: read the
localized topology conf and use max(global, min(override, max)). The
topology override is bounded by worker.max.timeout.secs -- Nimbus
already clamps it at submission, re-applied here defensively against
the override component only so the global floor is never reduced.
When the topology conf cannot be read (the orphaned worker dirs this
filter targets often outlive their conf) fall back to the global
timeout. The per-topology result is cached within a reporting round.

Add regression tests for a longer per-topology override, the
worker.max.timeout.secs cap, and the conf-read-failure fallback.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ReportWorkerHeartbeats stale-heartbeat filter ignores per-topology topology.worker.timeout.secs

1 participant