feat: Introduce lag emergency option for cost-based autoscaler by Fly-Style · Pull Request #19655 · apache/druid

Fly-Style · 2026-07-03T12:12:18Z

Summary

Adds a two-tier "critical lag" fast path to the cost-based Kafka/Kinesis supervisor autoscaler, so an operator-configured SLA threshold can trigger a faster scale-up reaction than the normal cost-minimization loop provides.

CostBasedAutoScalerConfig gains an optional criticalLagThreshold (aggregate/global lag across all partitions, not per-partition). When set, CostMetrics#getAggregateLag() is compared against it on every evaluation:

Tier 1 (≥75% of threshold): the lag-amplification multiplier used in the cost function's lag-recovery-time term maxes out at 6.0 (vs. the default 0.3), and the scale-up candidate search bypasses its usual boundary cap (useTaskCountBoundariesOnScaleUp), letting the argmin search the full candidate range instead of being capped to +2 steps above the current task count.
Tier 2 (≥95% of threshold): cost minimization is skipped entirely — the autoscaler jumps straight to the maximum valid task count, regardless of configured weights or idle ratio.

criticalLagThreshold is null by default, which leaves existing behavior unchanged. The 75%/95% fractions and the 6.0 multiplier are fixed algorithm constants (WeightedCostFunction.CRITICAL_LAG_TIER1_FRACTION, CRITICAL_LAG_TIER2_FRACTION, CRITICAL_LAG_AMPLIFICATION_MULTIPLIER), not separately configurable.

Details

Plot shows how the tier-1 fast path (amp=6.0) responds across a range of operating conditions, holding partitionCount=500, aggregateLag=10M (above the 75% tier-1 mark but below the 95% tier-2 mark, given criticalLagThreshold=11.5M), and currentTaskCount=50 fixed:

X-axis — lagWeight/idleWeight pairs (the two cost-function weights, always summing to 1.0), swept from lag-averse (0.1/0.9) to lag-aggressive (0.9/0.1).
Y-axis — the resulting optimal task count chosen by the argmin search for each weight pair.
Lines — one per (idle, rate) combination: color encodes current idle ratio (red=0.1, green=0.25), line style encodes processing rate (solid=50k/s, dashed=80k/s, dotted=100k/s).

The takeaway: even holding lag constant, the tier-1 boost isn't a flat override — task count still climbs smoothly as lagWeight increases, but low-idle/low-rate combinations (red solid) saturate at the task-count ceiling far earlier (~0.6/0.4) than high-idle/high-rate combinations (green dotted, ~0.9/0.1), because those still have slack the idle-cost term can trade off against. This demonstrates tier 1 amplifies urgency rather than short-circuiting the optimization the way tier 2 does.

This PR has:

been self-reviewed.

FrankChen021

Severity	Findings
P0	0
P1	1
P2	0
P3	0
Total	1

Reviewed 6 of 6 changed files.

Found 1 issue: tier-2 emergency lag can still keep the current task count instead of forcing the configured maximum.

This is an automated review by Codex GPT-5.5

FrankChen021 · 2026-07-03T16:01:13Z


+    // Emergency (tier 2) lag skips the argmin search entirely: evaluate only the maximum valid task count.
+    if (emergencyLag) {
+      startIndex = validTaskCounts.length - 1;


[P1] Make emergency lag choose the max unconditionally

The emergency path narrows the loop to only the maximum valid task count, but optimalTaskCount and optimalCost are still initialized from the current task count before this block. If the configured weights make the current-count cost lower than the max-count cost, for example lagWeight=0 and idleWeight=1, the loop evaluates the max candidate but never updates optimalTaskCount, so tier-2 emergency lag returns the current count instead of jumping to the max. Set the emergency result to the max candidate unconditionally, or initialize the optimal candidate from startIndex after the emergency bounds are applied.

Introduce lag emergency

f402975

github-actions Bot added the Area - Ingestion label Jul 3, 2026

Fly-Style changed the title ~~Introduce lag emergency~~ feat: Introduce lag emergency option for cost-based autoscaler Jul 3, 2026

Checkstyle

d555233

FrankChen021 reviewed Jul 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Introduce lag emergency option for cost-based autoscaler#19655

feat: Introduce lag emergency option for cost-based autoscaler#19655
Fly-Style wants to merge 2 commits into
apache:masterfrom
Fly-Style:cba-lag-emergency

Fly-Style commented Jul 3, 2026 •

edited

Loading

Uh oh!

FrankChen021 left a comment

Uh oh!

FrankChen021 Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Fly-Style commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

FrankChen021 left a comment

Choose a reason for hiding this comment

Uh oh!

FrankChen021 Jul 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fly-Style commented Jul 3, 2026 •

edited

Loading