feat(functions): add string distance/similarity functions by nish2292 · Pull Request #7068 · Eventual-Inc/Daft

nish2292 · 2026-06-04T01:19:22Z

Changes Made

Add four pairwise string distance/similarity functions as pure Rust scalar UDFs:

levenshtein_distance - minimum edit distance (Int64)
jaro_similarity - similarity score 0.0-1.0 (Float64)
jaro_winkler_similarity - Jaro with prefix bonus (Float64)
damerau_levenshtein_distance - Levenshtein + transpositions (Int64)

Follows the existing hamming_distance_str pattern. No external dependencies. Exposed via daft.functions API and as Expression methods. Null-safe (returns null when either input is null).

Related Issues

Fixes #6794

Test Plan

24 pytest test cases in tests/functions/test_string_distance.py:

Levenshtein (6 tests): basic edit distance, empty strings, null handling, identical strings, single-char edits (substitution/insertion/deletion), expression method
Jaro (6 tests): identical strings, completely different strings, known reference values (martha/marhta = 0.944444), null handling, empty vs nonempty, expression method
Jaro-Winkler (6 tests): identical strings, prefix bonus >= Jaro invariant, known reference values (martha/marhta = 0.961111), no common prefix (JW == Jaro), null handling, expression method
Damerau-Levenshtein (6 tests): basic transposition, transposition vs standard Levenshtein (ab/ba = 1 vs 2), empty strings, identical strings, null handling, expression method

DAFT_RUNNER=native pytest tests/functions/test_string_distance.py -v
24 passed in 0.06s

Rust compilation verified:

cargo check --workspace  # zero errors
cargo clippy -p daft-functions-utf8 --no-deps  # zero warnings on new code

AI Disclosure

AI-assisted implementation (Claude Opus 4.6).

- add levenshtein_distance, jaro_similarity, jaro_winkler_similarity, damerau_levenshtein_distance - pure Rust implementations with no external dependencies, following hamming_distance_str pattern - expose as top-level daft.functions API and Expression methods - handle null inputs (return null) and null-typed columns (DataType::Null) - include 24 pytest test cases covering correctness, edge cases, and null handling

- apply rustfmt to levenshtein.rs, jaro.rs, damerau_levenshtein.rs - apply ruff format to test_string_distance.py - fix jaro_similarity and jaro_winkler_similarity docstring examples to use full-precision Float64 output

- use i64::from(bool) instead of if/else for boolean-to-int conversion - use iter_mut().enumerate() instead of indexing loop (needless_range_loop) - use mul_add for jaro-winkler formula (suboptimal_flops) - replace "abd" with "acb" in docstring example (spellcheck flagged "abd")

nish2292 · 2026-06-04T03:13:22Z

Seems like the failure in the PRBs are unrelated like in #7063 caused by a non-deterministic seeded shuffle..

greptile-apps · 2026-06-04T03:31:39Z

Greptile Summary

This PR adds four pairwise string distance/similarity functions — levenshtein_distance, jaro_similarity, jaro_winkler_similarity, and damerau_levenshtein_distance — as Rust scalar UDFs, following the established hamming_distance_str pattern. A shared generic helper binary_str_distance in utils.rs eliminates per-function boilerplate for null tracking, broadcasting, and array construction.

Four new Rust UDFs in daft-functions-utf8, each delegating evaluation and field inference to shared helpers in utils.rs.
Python exposure via daft.functions and Expression methods; damerau_levenshtein_distance is correctly documented as the OSA variant with a disclaimer that results may diverge from true Damerau-Levenshtein.
24 pytest cases covering correctness, null propagation, empty-string edge cases, scalar broadcasting, and expression-method equivalence.

Confidence Score: 5/5

Safe to merge — all four algorithms are correctly implemented, null propagation and scalar broadcasting work as intended, and the OSA distinction for Damerau-Levenshtein is disclosed at both the Rust and Python documentation levels.

The shared binary_str_distance helper in utils.rs eliminates the previously flagged boilerplate duplication. The Jaro algorithm was manually verified against the martha/marhta reference (0.9444). Levenshtein's two-row space optimization is correct and symmetric. Jaro-Winkler's mul_add formula matches the standard. The OSA variant disclaimer is present in the Python docstring. Tests cover all critical edge cases including null propagation and scalar broadcasting in both directions.

No files require special attention.

Important Files Changed

Filename	Overview
src/daft-functions-utf8/src/utils.rs	Adds binary_str_distance and binary_str_distance_to_field shared helpers; correctly handles null tracking via NullBufferBuilder, scalar broadcasting, full-null short-circuit, and empty array cases.
src/daft-functions-utf8/src/levenshtein.rs	Correct Wagner-Fischer implementation using two-row O(min(n,m)) space; safely swaps shorter/longer strings for inner-loop efficiency.
src/daft-functions-utf8/src/jaro.rs	Correct Jaro implementation; match_distance uses saturating_sub to avoid underflow; transposition counting loop verified against martha/marhta reference (0.9444).
src/daft-functions-utf8/src/jaro_winkler.rs	Correct Jaro-Winkler formula via mul_add; prefix-length capped at 4 characters; reuses compute_jaro_similarity to avoid duplication.
src/daft-functions-utf8/src/damerau_levenshtein.rs	Implements OSA variant; correctly documented at both Rust and Python docstring level including the concrete CA→ABC divergence example.
daft/functions/str.py	Four new Python wrappers with complete docstrings, doctests, type annotations, and OSA disclaimer; follows hamming_distance_str pattern exactly.
daft/expressions/expressions.py	Adds four Expression methods using inline imports, consistent with the existing hamming/regexp_count/length_bytes pattern in the same file.
tests/functions/test_string_distance.py	24 tests covering all four functions: correctness, empty strings, null propagation, known reference values, scalar broadcasting both directions, and expression-method equivalence.

_{Reviews (3): Last reviewed commit: "Merge branch 'main' into feat/string-dis..." | Re-trigger Greptile}

…ions - document damerau_levenshtein_distance computes OSA variant, noting it differs from true Damerau-Levenshtein for overlapping transpositions - extract shared binary_str_distance and binary_str_distance_to_field helpers in utils.rs - collapse identical call/get_return_field boilerplate across the 4 UDFs into the generic helpers (-91 net lines) - update Rust docstring for damerau_levenshtein to note OSA variant

cckellogg · 2026-06-04T20:27:54Z

@greptileai

cckellogg · 2026-06-04T21:54:45Z

+            let mut values = Vec::with_capacity(len);
+            let mut validity = NullBufferBuilder::new(len);
+
+            for i in 0..len {


Thanks for the contribution. I think binary_str_distance might need to use the existing broadcast-aware UTF-8 input handling here.

Right now it sets len = left.len() and then indexes both arrays with left.get(i) / right.get(i). That works for column-column inputs with equal lengths, but it breaks valid scalar broadcast cases like:

levenshtein_distance(col("a"), "kitten") levenshtein_distance("kitten", col("a"))

For col("a"), "kitten", right has length 1, so right.get(i) can go out of bounds once i > 0. For "kitten", col("a"), the helper uses left.len() as the output length, so it can produce only one row instead of broadcasting the left scalar across the column.

Could we reuse the same pattern as the nearby UTF-8 helpers, parse_inputs to compute/validate the expected output size, then create_broadcasted_str_iter for both sides? find_impl does something similar. Roughly it would looks like this:

left.with_utf8_array(|left| { right.with_utf8_array(|right| { let (is_full_null, expected_size) = parse_inputs(left, &[right]) .map_err(|e| DaftError::ValueError(format!("Error in {name}: {e}")))?; if is_full_null { return Ok( DataArray::<T>::full_null(name, &return_dtype, expected_size).into_series(), ); } if expected_size == 0 { return Ok(DataArray::<T>::empty(name, &return_dtype).into_series()); } let left_iter = create_broadcasted_str_iter(left, expected_size); let right_iter = create_broadcasted_str_iter(right, expected_size); let mut values = Vec::with_capacity(expected_size); let mut validity = NullBufferBuilder::new(expected_size); for (l, r) in left_iter.zip(right_iter) { match (l, r) { (Some(l), Some(r)) => { values.push(compute(l, r)); validity.append_non_null(); } _ => { values.push(T::Native::default()); validity.append_null(); } } } let result = DataArray::<T>::from_field_and_values(field.clone(), values) .with_nulls(validity.finish())?; Ok(result.into_series()) }) })

Good catch, I've addressed this as per your suggestions. I confirmed that it resolved the OOB and length issues. Also added regression tests for scalar broadcasts.

- rewrite binary_str_distance to use parse_inputs + create_broadcasted_str_iter, matching the broadcast-aware pattern of other utf8 helpers - fixes out-of-bounds panic for col-scalar (e.g. levenshtein_distance(col("a"), "kitten")) and wrong-length output for scalar-col inputs - handle full-null and empty-input cases explicitly - add TestScalarBroadcast regression tests covering col-scalar, scalar-col, and null-scalar - addresses maintainer review feedback (PR Eventual-Inc#7068)

codspeed-hq · 2026-06-05T18:11:28Z

Merging this PR will not alter performance

✅ 40 untouched benchmarks
⏩ 10 skipped benchmarks¹

_{Comparing nish2292:feat/string-distance-functions (1651d2e) with main (afa221d)}

10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

cckellogg · 2026-06-08T21:03:23Z

@greptileai

* origin/main: (115 commits) feat: add ignore_corrupt_files option to read_parquet, read_csv and read_iceberg (Eventual-Inc#6520) fix(deps): gate vllm to Linux so macOS/Windows resolve without CUDA wheels (Eventual-Inc#7095) fix: pass options in Gravitino PostgreSQL read method (Eventual-Inc#7047) feat(ray): Implement dynamic scale-in for RaySwordfishActor (Eventual-Inc#5903) feat(delta-lake): support column mapping for reads (Eventual-Inc#7005) feat(functions): add string distance/similarity functions (Eventual-Inc#7068) test(parquet): cover read_parquet edge cases (Eventual-Inc#7085) refactor(checkpoint): drop "seal" vocabulary from Rust API surface (Eventual-Inc#7078) fix(asof-join): use unknown clustering spec instead of hash (Eventual-Inc#7075) docs: standardize Slack links to use daft.ai/slack (Eventual-Inc#7066) feat: add try_cast function for safe type conversion (Eventual-Inc#6960) refactor(file): rename File byte-range fields to position/size (Eventual-Inc#6747) fix(ray): configure worker startup timeout on runner (Eventual-Inc#7055) feat(shuffle): default flight shuffle compression to lz4 (Eventual-Inc#7071) feat(iceberg): support branch and tag reads (Eventual-Inc#7042) fix(shuffle): concat recordbatches before repartition (Eventual-Inc#7064) perf: update jemalloc 5.3.0 → 5.3.1 to fix muzzy decay performance bug (Eventual-Inc#7059) feat: thread assume_sorted_and_aligned_partitions parameter through ASOF join (Eventual-Inc#7067) fix(flight-shuffle): reduce coordinator memory to O(map_tasks + partitions) (Eventual-Inc#7056) refactor(distributed): rename needs_hash_repartition to can_skip_hash_repartition (Eventual-Inc#7053) ... # Conflicts: # daft/checkpoint.py # src/daft-distributed/src/pipeline_node/limit.rs # src/daft-distributed/src/pipeline_node/stage_checkpoint_keys.rs # src/daft-distributed/src/scheduling/task.rs # src/daft-local-execution/src/pipeline.rs # src/daft-local-execution/src/sinks/blocking_sink.rs # src/daft-local-execution/src/sources/scan_task.rs

github-actions Bot added the feat label Jun 4, 2026

nish2292 and others added 3 commits June 3, 2026 18:34

fix: resolve style and doctest CI failures

51add79

- apply rustfmt to levenshtein.rs, jaro.rs, damerau_levenshtein.rs - apply ruff format to test_string_distance.py - fix jaro_similarity and jaro_winkler_similarity docstring examples to use full-precision Float64 output

Merge branch 'main' into feat/string-distance-functions

abdc4e4

nish2292 marked this pull request as ready for review June 4, 2026 03:27

greptile-apps Bot reviewed Jun 4, 2026

View reviewed changes

Comment thread src/daft-functions-utf8/src/levenshtein.rs Outdated

cckellogg reviewed Jun 4, 2026

View reviewed changes

nish2292 and others added 3 commits June 4, 2026 17:03

Merge branch 'main' into feat/string-distance-functions

ae6fcf2

Merge branch 'main' into feat/string-distance-functions

d3ae12c

nish2292 added 3 commits June 5, 2026 14:37

Merge branch 'main' into feat/string-distance-functions

f45f529

Merge branch 'main' into feat/string-distance-functions

d1c7628

Merge branch 'main' into feat/string-distance-functions

1651d2e

cckellogg merged commit 806719a into Eventual-Inc:main Jun 9, 2026
93 of 97 checks passed

nish2292 deleted the feat/string-distance-functions branch June 9, 2026 16:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(functions): add string distance/similarity functions#7068

feat(functions): add string distance/similarity functions#7068
cckellogg merged 11 commits into
Eventual-Inc:mainfrom
nish2292:feat/string-distance-functions

nish2292 commented Jun 4, 2026 •

edited

Loading

Uh oh!

nish2292 commented Jun 4, 2026

Uh oh!

greptile-apps Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

cckellogg commented Jun 4, 2026

Uh oh!

cckellogg Jun 4, 2026

Uh oh!

nish2292 Jun 5, 2026

Uh oh!

codspeed-hq Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

cckellogg commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

nish2292 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Related Issues

Test Plan

AI Disclosure

Uh oh!

nish2292 commented Jun 4, 2026

Uh oh!

greptile-apps Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Uh oh!

Uh oh!

cckellogg commented Jun 4, 2026

Uh oh!

cckellogg Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

nish2292 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

codspeed-hq Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will not alter performance

Footnotes

Uh oh!

cckellogg commented Jun 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nish2292 commented Jun 4, 2026 •

edited

Loading

greptile-apps Bot commented Jun 4, 2026 •

edited

Loading

codspeed-hq Bot commented Jun 5, 2026 •

edited

Loading