feat: Introduce lag emergency option for cost-based autoscaler#19655
feat: Introduce lag emergency option for cost-based autoscaler#19655Fly-Style wants to merge 2 commits into
Conversation
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 1 |
| P2 | 0 |
| P3 | 0 |
| Total | 1 |
Reviewed 6 of 6 changed files.
Found 1 issue: tier-2 emergency lag can still keep the current task count instead of forcing the configured maximum.
This is an automated review by Codex GPT-5.5
|
|
||
| // Emergency (tier 2) lag skips the argmin search entirely: evaluate only the maximum valid task count. | ||
| if (emergencyLag) { | ||
| startIndex = validTaskCounts.length - 1; |
There was a problem hiding this comment.
[P1] Make emergency lag choose the max unconditionally
The emergency path narrows the loop to only the maximum valid task count, but optimalTaskCount and optimalCost are still initialized from the current task count before this block. If the configured weights make the current-count cost lower than the max-count cost, for example lagWeight=0 and idleWeight=1, the loop evaluates the max candidate but never updates optimalTaskCount, so tier-2 emergency lag returns the current count instead of jumping to the max. Set the emergency result to the max candidate unconditionally, or initialize the optimal candidate from startIndex after the emergency bounds are applied.
Summary
Adds a two-tier "critical lag" fast path to the cost-based Kafka/Kinesis supervisor autoscaler, so an operator-configured SLA threshold can trigger a faster scale-up reaction than the normal cost-minimization loop provides.
CostBasedAutoScalerConfiggains an optionalcriticalLagThreshold(aggregate/global lag across all partitions, not per-partition). When set,CostMetrics#getAggregateLag()is compared against it on every evaluation:6.0(vs. the default0.3), and the scale-up candidate search bypasses its usual boundary cap (useTaskCountBoundariesOnScaleUp), letting the argmin search the full candidate range instead of being capped to +2 steps above the current task count.criticalLagThresholdisnullby default, which leaves existing behavior unchanged. The 75%/95% fractions and the 6.0 multiplier are fixed algorithm constants (WeightedCostFunction.CRITICAL_LAG_TIER1_FRACTION,CRITICAL_LAG_TIER2_FRACTION,CRITICAL_LAG_AMPLIFICATION_MULTIPLIER), not separately configurable.Details
Plot shows how the tier-1 fast path (amp=6.0) responds across a range of operating conditions, holding
partitionCount=500,aggregateLag=10M(above the 75% tier-1 mark but below the 95% tier-2 mark, givencriticalLagThreshold=11.5M), andcurrentTaskCount=50fixed:lagWeight/idleWeightpairs (the two cost-function weights, always summing to 1.0), swept from lag-averse (0.1/0.9) to lag-aggressive (0.9/0.1).(idle, rate)combination: color encodes current idle ratio (red=0.1, green=0.25), line style encodes processing rate (solid=50k/s, dashed=80k/s, dotted=100k/s).The takeaway: even holding lag constant, the tier-1 boost isn't a flat override — task count still climbs smoothly as
lagWeightincreases, but low-idle/low-rate combinations (red solid) saturate at the task-count ceiling far earlier (~0.6/0.4) than high-idle/high-rate combinations (green dotted, ~0.9/0.1), because those still have slack the idle-cost term can trade off against. This demonstrates tier 1 amplifies urgency rather than short-circuiting the optimization the way tier 2 does.This PR has: