Skip to content

feat: add width_bucket function for PySpark parity#7146

Open
YuangGao wants to merge 4 commits into
Eventual-Inc:mainfrom
YuangGao:feat/width-bucket
Open

feat: add width_bucket function for PySpark parity#7146
YuangGao wants to merge 4 commits into
Eventual-Inc:mainfrom
YuangGao:feat/width-bucket

Conversation

@YuangGao

Copy link
Copy Markdown
Contributor

Changes Made

Implements width_bucket(value, min, max, num_bucket) for PySpark parity, returning the 1-indexed equiwidth histogram bucket of value in [min, max].

  • Returns nullable Int64: 0 below the range, num_bucket + 1 at or above.
  • Supports descending bounds (min > max); orientation flips per Spark's spec.
  • Returns NULL when num_bucket <= 0, num_bucket == i64::MAX, min == max, value is NaN, or min/max is NaN/Infinite — mirrors Spark's WidthBucket.
  • Accepts any numeric input; casts to (f64, f64, f64, i64) internally, with non-integer num_bucket truncating toward zero (cf. pmod/hypot).

Related Issues

Part of #3793.

@github-actions github-actions Bot added the feat label Jun 22, 2026
@YuangGao YuangGao marked this pull request as ready for review June 22, 2026 00:36
@greptile-apps

greptile-apps Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Adds width_bucket(value, min, max, num_bucket) for PySpark parity, returning the 1-indexed equiwidth histogram bucket as a nullable Int64. The implementation mirrors Spark's WidthBucket NULL semantics (invalid num_bucket, min == max, NaN/Infinite bounds, NaN value) and handles descending ranges by flipping the comparison direction.

  • Rust core (width_bucket.rs): casts inputs to (f64, f64, f64, i64), broadcasts scalars via align_lengths, and calls compute_bucket which uses saturating_add(1) throughout to safely handle nb values near i64::MAX whose f64 cast rounds up to 2^63.
  • Python bindings (numeric.py, __init__.py): thin wrapper calling _call_builtin_scalar_fn; width_bucket is appended after fill_nan in the import block and placed alphabetically in __all__.
  • Tests (test_numeric.py): nine test cases covering ascending, descending, integer inputs, boundary values, all NULL guards, null propagation, scalar broadcast, fractional num_bucket truncation, and the i64::MAX - 1 saturation edge case.

Confidence Score: 5/5

Safe to merge — the new function is well-isolated, all edge cases are covered by tests, and the saturating arithmetic correctly handles near-i64::MAX bucket counts without overflow.

The implementation faithfully mirrors Spark's WidthBucket semantics, correctly handles all NULL-returning guards, and uses saturating_add(1) to prevent any integer overflow on the bucket calculation and the overflow-bucket return path. The test suite explicitly exercises the i64::MAX-1 saturation edge case, boundary values, descending ranges, null propagation, and type errors. No functional or correctness issues were identified.

No files require special attention.

Important Files Changed

Filename Overview
src/daft-functions/src/numeric/width_bucket.rs Core implementation of width_bucket; bucket logic, NULL guards, descending-range support, and saturating arithmetic are all correct and well-tested.
daft/functions/numeric.py Adds Python wrapper width_bucket with clear docstring; parameter names shadow built-ins but are intentional for PySpark parity.
daft/functions/init.py Exports width_bucket appended after fill_nan in the numeric import block and alphabetically placed in all; no issues.
src/daft-functions/src/numeric/mod.rs Registers the new WidthBucket UDF with the numeric function module; change is minimal and correct.
tests/recordbatch/numeric/test_numeric.py Comprehensive test suite covering ascending/descending ranges, integer inputs, boundary values, all NULL guards, null propagation, scalar broadcast, fractional num_bucket truncation, type errors, and the i64::MAX-range saturation edge case.

Reviews (2): Last reviewed commit: "fix(width_bucket): saturate +1 near i64:..." | Re-trigger Greptile



def test_width_bucket_basic() -> None:
from daft.functions import width_bucket

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Inline imports inside test functions

from daft.functions import width_bucket is repeated at the start of each of the nine new test functions. Per the project's style rule, import statements should be at the top of the file rather than inside function bodies. A single top-level import (alongside from daft import col, lit) covers all tests.

Rule Used: Import statements should be placed at the top of t... (source)

Learned From
Eventual-Inc/Daft#5078

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a valid concern

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, updated!

Comment on lines +140 to +155
if nb <= 0
|| nb == i64::MAX
|| v.is_nan()
|| mn == mx
|| mn.is_nan()
|| mn.is_infinite()
|| mx.is_nan()
|| mx.is_infinite()
{
return None;
}
let bucket = if mn < mx {
if v < mn {
0
} else if v >= mx {
nb + 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Integer overflow for nb values just below i64::MAX

The guard nb == i64::MAX does not cover the full risky range. For nb in approximately [i64::MAX − 1023, i64::MAX − 1], the cast nb as f64 rounds up to 2^63 (the nearest representable f64). When v is close to mx, the formula ((nb as f64) * fraction) as i64 saturates to i64::MAX (Rust's saturating f64-to-i64 cast), and the subsequent + 1 wraps to i64::MIN in release mode (or panics in debug). Using saturating_add(1) instead of + 1 for both branch returns makes this safe without changing observed behaviour for all practical inputs.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@YuangGao

Copy link
Copy Markdown
Contributor Author

@greptileai

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants