Skip to content

Fix double max.free.per.topology cap in EvenScheduler idle rebalance#8798

Open
mwkang wants to merge 1 commit into
apache:masterfrom
mwkang:8797-even-rebalance-double-cap
Open

Fix double max.free.per.topology cap in EvenScheduler idle rebalance#8798
mwkang wants to merge 1 commit into
apache:masterfrom
mwkang:8797-even-rebalance-double-cap

Conversation

@mwkang

@mwkang mwkang commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

What this PR does

DefaultScheduler.defaultSchedule runs the idle-supervisor redistribute pass twice per scheduling round — once over the full topology set (DefaultScheduler.java:75), then again inside the per-leftover EvenScheduler.scheduleTopologiesEvenly call (EvenScheduler.java:336).
So the per-topology nimbus.even.rebalance.max.free.per.topology cap is applied twice, and an under-assigned topology can relocate up to 2 × cap workers in a single round.

The fix adds a redistributeOntoIdle toggle to scheduleTopologiesEvenly:

  • the public 2-arg entry point keeps redistribute on, so using EvenScheduler directly is unchanged;
  • DefaultScheduler runs the single full-set redistribute itself and delegates per leftover topology with the toggle off.

The cap now binds once per round, while the full-set round-robin fairness that lets multiple topologies share idle slots is preserved (simply dropping the full-set call would break it).

Follow-up to #8778. The feature is off by default, so there is no behavior change unless nimbus.even.rebalance.idle.supervisor.enabled=true.

Closes #8797.

How it was tested

Added defaultSchedulerAppliesMaxFreeCapOncePerRound: the DefaultScheduler path with two idle supervisors, max.free.per.topology=1, and an under-assigned topology — it relocates two workers before the fix and exactly one after. Existing TestEvenSchedulerIdleSupervisor and the other scheduler tests still pass.

DefaultScheduler.defaultSchedule ran the idle-supervisor redistribute
twice per scheduling round: once over the full topology set at the top
(the round-robin fairness pass), then again inside the per-leftover
EvenScheduler.scheduleTopologiesEvenly call, which also redistributes on
entry. The full-set pass consumes one idle supervisor, leaving any other
idle supervisor for the per-topology pass to fill again, so the
per-topology max.free.per.topology cap was applied twice. With the
feature enabled, a positive cap and two or more idle supervisors, an
under-assigned topology could relocate up to 2 * cap existing workers in
a single round -- throttle-exceeding churn, not a crash or loss. Scoped
to the DefaultScheduler (and IsolationScheduler leftover) path;
EvenScheduler used directly schedules each topology once and is
unaffected.

The redistribute was added at both entry points by apache#8778
without a re-entry guard. Add a redistributeOntoIdle toggle to
scheduleTopologiesEvenly: the public two-arg entry point keeps
redistribute=true (EvenScheduler's own single full-set pass is
unchanged), while DefaultScheduler runs the single full-set redistribute
itself and delegates per leftover topology with the toggle off. The cap
then binds once per round and the full-set round-robin fairness across
topologies is preserved -- simply dropping the first call would break
that fairness.

The redistribute pass is gated off by default, so there is no behavior
change for existing clusters.

Add a regression test exercising the DefaultScheduler path with two idle
supervisors, max.free.per.topology=1 and an under-assigned topology:
before the fix it relocates two workers (one onto each idle supervisor),
after the fix exactly one.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EvenScheduler idle rebalance: max.free.per.topology cap applied twice per round in DefaultScheduler path (follow-up to #8778)

1 participant