Skip to content

feat: Introduce lag emergency option for cost-based autoscaler#19655

Open
Fly-Style wants to merge 2 commits into
apache:masterfrom
Fly-Style:cba-lag-emergency
Open

feat: Introduce lag emergency option for cost-based autoscaler#19655
Fly-Style wants to merge 2 commits into
apache:masterfrom
Fly-Style:cba-lag-emergency

Conversation

@Fly-Style

@Fly-Style Fly-Style commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a two-tier "critical lag" fast path to the cost-based Kafka/Kinesis supervisor autoscaler, so an operator-configured SLA threshold can trigger a faster scale-up reaction than the normal cost-minimization loop provides.

CostBasedAutoScalerConfig gains an optional criticalLagThreshold (aggregate/global lag across all partitions, not per-partition). When set, CostMetrics#getAggregateLag() is compared against it on every evaluation:

  • Tier 1 (≥75% of threshold): the lag-amplification multiplier used in the cost function's lag-recovery-time term maxes out at 6.0 (vs. the default 0.3), and the scale-up candidate search bypasses its usual boundary cap (useTaskCountBoundariesOnScaleUp), letting the argmin search the full candidate range instead of being capped to +2 steps above the current task count.
  • Tier 2 (≥95% of threshold): cost minimization is skipped entirely — the autoscaler jumps straight to the maximum valid task count, regardless of configured weights or idle ratio.

criticalLagThreshold is null by default, which leaves existing behavior unchanged. The 75%/95% fractions and the 6.0 multiplier are fixed algorithm constants (WeightedCostFunction.CRITICAL_LAG_TIER1_FRACTION, CRITICAL_LAG_TIER2_FRACTION, CRITICAL_LAG_AMPLIFICATION_MULTIPLIER), not separately configurable.

Details

Plot shows how the tier-1 fast path (amp=6.0) responds across a range of operating conditions, holding partitionCount=500, aggregateLag=10M (above the 75% tier-1 mark but below the 95% tier-2 mark, given criticalLagThreshold=11.5M), and currentTaskCount=50 fixed:

  • X-axislagWeight/idleWeight pairs (the two cost-function weights, always summing to 1.0), swept from lag-averse (0.1/0.9) to lag-aggressive (0.9/0.1).
  • Y-axis — the resulting optimal task count chosen by the argmin search for each weight pair.
  • Lines — one per (idle, rate) combination: color encodes current idle ratio (red=0.1, green=0.25), line style encodes processing rate (solid=50k/s, dashed=80k/s, dotted=100k/s).

The takeaway: even holding lag constant, the tier-1 boost isn't a flat override — task count still climbs smoothly as lagWeight increases, but low-idle/low-rate combinations (red solid) saturate at the task-count ceiling far earlier (~0.6/0.4) than high-idle/high-rate combinations (green dotted, ~0.9/0.1), because those still have slack the idle-cost term can trade off against. This demonstrates tier 1 amplifies urgency rather than short-circuiting the optimization the way tier 2 does.

image

This PR has:

  • been self-reviewed.

@Fly-Style Fly-Style changed the title Introduce lag emergency feat: Introduce lag emergency option for cost-based autoscaler Jul 3, 2026

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 1
P2 0
P3 0
Total 1

Reviewed 6 of 6 changed files.

Found 1 issue: tier-2 emergency lag can still keep the current task count instead of forcing the configured maximum.


This is an automated review by Codex GPT-5.5


// Emergency (tier 2) lag skips the argmin search entirely: evaluate only the maximum valid task count.
if (emergencyLag) {
startIndex = validTaskCounts.length - 1;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Make emergency lag choose the max unconditionally

The emergency path narrows the loop to only the maximum valid task count, but optimalTaskCount and optimalCost are still initialized from the current task count before this block. If the configured weights make the current-count cost lower than the max-count cost, for example lagWeight=0 and idleWeight=1, the loop evaluates the max candidate but never updates optimalTaskCount, so tier-2 emergency lag returns the current count instead of jumping to the max. Set the emergency result to the max candidate unconditionally, or initialize the optimal candidate from startIndex after the emergency bounds are applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants