feat: add width_bucket function for PySpark parity by YuangGao · Pull Request #7146 · Eventual-Inc/Daft

YuangGao · 2026-06-22T00:35:36Z

Changes Made

Implements width_bucket(value, min, max, num_bucket) for PySpark parity, returning the 1-indexed equiwidth histogram bucket of value in [min, max].

Returns nullable Int64: 0 below the range, num_bucket + 1 at or above.
Supports descending bounds (min > max); orientation flips per Spark's spec.
Returns NULL when num_bucket <= 0, num_bucket == i64::MAX, min == max, value is NaN, or min/max is NaN/Infinite — mirrors Spark's WidthBucket.
Accepts any numeric input; casts to (f64, f64, f64, i64) internally, with non-integer num_bucket truncating toward zero (cf. pmod/hypot).

Related Issues

Part of #3793.

greptile-apps · 2026-06-22T00:46:22Z

Greptile Summary

Adds width_bucket(value, min, max, num_bucket) for PySpark parity, returning the 1-indexed equiwidth histogram bucket as a nullable Int64. The implementation mirrors Spark's WidthBucket NULL semantics (invalid num_bucket, min == max, NaN/Infinite bounds, NaN value) and handles descending ranges by flipping the comparison direction.

Rust core (width_bucket.rs): casts inputs to (f64, f64, f64, i64), broadcasts scalars via align_lengths, and calls compute_bucket which uses saturating_add(1) throughout to safely handle nb values near i64::MAX whose f64 cast rounds up to 2^63.
Python bindings (numeric.py, __init__.py): thin wrapper calling _call_builtin_scalar_fn; width_bucket is appended after fill_nan in the import block and placed alphabetically in __all__.
Tests (test_numeric.py): nine test cases covering ascending, descending, integer inputs, boundary values, all NULL guards, null propagation, scalar broadcast, fractional num_bucket truncation, and the i64::MAX - 1 saturation edge case.

Confidence Score: 5/5

Safe to merge — the new function is well-isolated, all edge cases are covered by tests, and the saturating arithmetic correctly handles near-i64::MAX bucket counts without overflow.

The implementation faithfully mirrors Spark's WidthBucket semantics, correctly handles all NULL-returning guards, and uses saturating_add(1) to prevent any integer overflow on the bucket calculation and the overflow-bucket return path. The test suite explicitly exercises the i64::MAX-1 saturation edge case, boundary values, descending ranges, null propagation, and type errors. No functional or correctness issues were identified.

No files require special attention.

Important Files Changed

Filename	Overview
src/daft-functions/src/numeric/width_bucket.rs	Core implementation of width_bucket; bucket logic, NULL guards, descending-range support, and saturating arithmetic are all correct and well-tested.
daft/functions/numeric.py	Adds Python wrapper width_bucket with clear docstring; parameter names shadow built-ins but are intentional for PySpark parity.
daft/functions/init.py	Exports width_bucket appended after fill_nan in the numeric import block and alphabetically placed in all; no issues.
src/daft-functions/src/numeric/mod.rs	Registers the new WidthBucket UDF with the numeric function module; change is minimal and correct.
tests/recordbatch/numeric/test_numeric.py	Comprehensive test suite covering ascending/descending ranges, integer inputs, boundary values, all NULL guards, null propagation, scalar broadcast, fractional num_bucket truncation, type errors, and the i64::MAX-range saturation edge case.

_{Reviews (2): Last reviewed commit: "fix(width_bucket): saturate +1 near i64:..." | Re-trigger Greptile}

greptile-apps · 2026-06-22T00:46:27Z

+
+
+def test_width_bucket_basic() -> None:
+    from daft.functions import width_bucket


Inline imports inside test functions

from daft.functions import width_bucket is repeated at the start of each of the nine new test functions. Per the project's style rule, import statements should be at the top of the file rather than inside function bodies. A single top-level import (alongside from daft import col, lit) covers all tests.

Rule Used: Import statements should be placed at the top of t... (source)

Learned From
Eventual-Inc/Daft#5078

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

this seems like a valid concern

yes, updated!

greptile-apps · 2026-06-22T00:46:28Z

+    if nb <= 0
+        || nb == i64::MAX
+        || v.is_nan()
+        || mn == mx
+        || mn.is_nan()
+        || mn.is_infinite()
+        || mx.is_nan()
+        || mx.is_infinite()
+    {
+        return None;
+    }
+    let bucket = if mn < mx {
+        if v < mn {
+            0
+        } else if v >= mx {
+            nb + 1


Integer overflow for nb values just below i64::MAX

The guard nb == i64::MAX does not cover the full risky range. For nb in approximately [i64::MAX − 1023, i64::MAX − 1], the cast nb as f64 rounds up to 2^63 (the nearest representable f64). When v is close to mx, the formula ((nb as f64) * fraction) as i64 saturates to i64::MAX (Rust's saturating f64-to-i64 cast), and the subsequent + 1 wraps to i64::MIN in release mode (or panics in debug). Using saturating_add(1) instead of + 1 for both branch returns makes this safe without changing observed behaviour for all practical inputs.

YuangGao · 2026-06-22T01:06:10Z

@greptileai

feat: add width_bucket function for PySpark parity

48b13ee

github-actions Bot added the feat label Jun 22, 2026

YuangGao marked this pull request as ready for review June 22, 2026 00:36

greptile-apps Bot reviewed Jun 22, 2026

View reviewed changes

fix(width_bucket): saturate +1 near i64::MAX, tighten tests

e3b0af4

YuangGao and others added 2 commits June 22, 2026 16:45

style(width_bucket): hoist test import to module top

60909b7

Merge branch 'main' into feat/width-bucket

b5b3088

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add width_bucket function for PySpark parity#7146

feat: add width_bucket function for PySpark parity#7146
YuangGao wants to merge 4 commits into
Eventual-Inc:mainfrom
YuangGao:feat/width-bucket

YuangGao commented Jun 22, 2026

Uh oh!

greptile-apps Bot commented Jun 22, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot Jun 22, 2026

Uh oh!

srilman Jun 22, 2026

Uh oh!

YuangGao Jun 22, 2026

Uh oh!

greptile-apps Bot Jun 22, 2026

Uh oh!

YuangGao Jun 22, 2026

Uh oh!

YuangGao commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def test_width_bucket_basic() -> None:
		from daft.functions import width_bucket

Conversation

YuangGao commented Jun 22, 2026

Changes Made

Related Issues

Uh oh!

greptile-apps Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

greptile-apps Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

srilman Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

YuangGao Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

YuangGao Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

YuangGao commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented Jun 22, 2026 •

edited

Loading