feat: add width_bucket function for PySpark parity#7146
Conversation
Greptile SummaryAdds
Confidence Score: 5/5Safe to merge — the new function is well-isolated, all edge cases are covered by tests, and the saturating arithmetic correctly handles near-i64::MAX bucket counts without overflow. The implementation faithfully mirrors Spark's WidthBucket semantics, correctly handles all NULL-returning guards, and uses saturating_add(1) to prevent any integer overflow on the bucket calculation and the overflow-bucket return path. The test suite explicitly exercises the i64::MAX-1 saturation edge case, boundary values, descending ranges, null propagation, and type errors. No functional or correctness issues were identified. No files require special attention. Important Files Changed
Reviews (2): Last reviewed commit: "fix(width_bucket): saturate +1 near i64:..." | Re-trigger Greptile |
|
|
||
|
|
||
| def test_width_bucket_basic() -> None: | ||
| from daft.functions import width_bucket |
There was a problem hiding this comment.
Inline imports inside test functions
from daft.functions import width_bucket is repeated at the start of each of the nine new test functions. Per the project's style rule, import statements should be at the top of the file rather than inside function bodies. A single top-level import (alongside from daft import col, lit) covers all tests.
Rule Used: Import statements should be placed at the top of t... (source)
Learned From
Eventual-Inc/Daft#5078
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
this seems like a valid concern
| if nb <= 0 | ||
| || nb == i64::MAX | ||
| || v.is_nan() | ||
| || mn == mx | ||
| || mn.is_nan() | ||
| || mn.is_infinite() | ||
| || mx.is_nan() | ||
| || mx.is_infinite() | ||
| { | ||
| return None; | ||
| } | ||
| let bucket = if mn < mx { | ||
| if v < mn { | ||
| 0 | ||
| } else if v >= mx { | ||
| nb + 1 |
There was a problem hiding this comment.
Integer overflow for
nb values just below i64::MAX
The guard nb == i64::MAX does not cover the full risky range. For nb in approximately [i64::MAX − 1023, i64::MAX − 1], the cast nb as f64 rounds up to 2^63 (the nearest representable f64). When v is close to mx, the formula ((nb as f64) * fraction) as i64 saturates to i64::MAX (Rust's saturating f64-to-i64 cast), and the subsequent + 1 wraps to i64::MIN in release mode (or panics in debug). Using saturating_add(1) instead of + 1 for both branch returns makes this safe without changing observed behaviour for all practical inputs.
Changes Made
Implements
width_bucket(value, min, max, num_bucket)for PySpark parity, returning the 1-indexed equiwidth histogram bucket ofvaluein[min, max].0below the range,num_bucket + 1at or above.min > max); orientation flips per Spark's spec.num_bucket <= 0,num_bucket == i64::MAX,min == max,valueis NaN, ormin/maxis NaN/Infinite — mirrors Spark'sWidthBucket.(f64, f64, f64, i64)internally, with non-integernum_buckettruncating toward zero (cf.pmod/hypot).Related Issues
Part of #3793.