Skip to content

feat(functions): add string distance/similarity functions#7068

Merged
cckellogg merged 11 commits into
Eventual-Inc:mainfrom
nish2292:feat/string-distance-functions
Jun 9, 2026
Merged

feat(functions): add string distance/similarity functions#7068
cckellogg merged 11 commits into
Eventual-Inc:mainfrom
nish2292:feat/string-distance-functions

Conversation

@nish2292

@nish2292 nish2292 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Changes Made

Add four pairwise string distance/similarity functions as pure Rust scalar UDFs:

  • levenshtein_distance - minimum edit distance (Int64)
  • jaro_similarity - similarity score 0.0-1.0 (Float64)
  • jaro_winkler_similarity - Jaro with prefix bonus (Float64)
  • damerau_levenshtein_distance - Levenshtein + transpositions (Int64)

Follows the existing hamming_distance_str pattern. No external dependencies. Exposed via daft.functions API and as Expression methods. Null-safe (returns null when either input is null).

Related Issues

Fixes #6794

Test Plan

24 pytest test cases in tests/functions/test_string_distance.py:

  • Levenshtein (6 tests): basic edit distance, empty strings, null handling, identical strings, single-char edits (substitution/insertion/deletion), expression method
  • Jaro (6 tests): identical strings, completely different strings, known reference values (martha/marhta = 0.944444), null handling, empty vs nonempty, expression method
  • Jaro-Winkler (6 tests): identical strings, prefix bonus >= Jaro invariant, known reference values (martha/marhta = 0.961111), no common prefix (JW == Jaro), null handling, expression method
  • Damerau-Levenshtein (6 tests): basic transposition, transposition vs standard Levenshtein (ab/ba = 1 vs 2), empty strings, identical strings, null handling, expression method
DAFT_RUNNER=native pytest tests/functions/test_string_distance.py -v
24 passed in 0.06s

Rust compilation verified:

cargo check --workspace  # zero errors
cargo clippy -p daft-functions-utf8 --no-deps  # zero warnings on new code

AI Disclosure

AI-assisted implementation (Claude Opus 4.6).

- add levenshtein_distance, jaro_similarity, jaro_winkler_similarity, damerau_levenshtein_distance
- pure Rust implementations with no external dependencies, following hamming_distance_str pattern
- expose as top-level daft.functions API and Expression methods
- handle null inputs (return null) and null-typed columns (DataType::Null)
- include 24 pytest test cases covering correctness, edge cases, and null handling
@github-actions github-actions Bot added the feat label Jun 4, 2026
nish2292 and others added 3 commits June 3, 2026 18:34
- apply rustfmt to levenshtein.rs, jaro.rs, damerau_levenshtein.rs
- apply ruff format to test_string_distance.py
- fix jaro_similarity and jaro_winkler_similarity docstring examples to use full-precision Float64 output
- use i64::from(bool) instead of if/else for boolean-to-int conversion
- use iter_mut().enumerate() instead of indexing loop (needless_range_loop)
- use mul_add for jaro-winkler formula (suboptimal_flops)
- replace "abd" with "acb" in docstring example (spellcheck flagged "abd")
@nish2292

nish2292 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Seems like the failure in the PRBs are unrelated like in #7063 caused by a non-deterministic seeded shuffle..

@nish2292 nish2292 marked this pull request as ready for review June 4, 2026 03:27
@greptile-apps

greptile-apps Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds four pairwise string distance/similarity functions — levenshtein_distance, jaro_similarity, jaro_winkler_similarity, and damerau_levenshtein_distance — as Rust scalar UDFs, following the established hamming_distance_str pattern. A shared generic helper binary_str_distance in utils.rs eliminates per-function boilerplate for null tracking, broadcasting, and array construction.

  • Four new Rust UDFs in daft-functions-utf8, each delegating evaluation and field inference to shared helpers in utils.rs.
  • Python exposure via daft.functions and Expression methods; damerau_levenshtein_distance is correctly documented as the OSA variant with a disclaimer that results may diverge from true Damerau-Levenshtein.
  • 24 pytest cases covering correctness, null propagation, empty-string edge cases, scalar broadcasting, and expression-method equivalence.

Confidence Score: 5/5

Safe to merge — all four algorithms are correctly implemented, null propagation and scalar broadcasting work as intended, and the OSA distinction for Damerau-Levenshtein is disclosed at both the Rust and Python documentation levels.

The shared binary_str_distance helper in utils.rs eliminates the previously flagged boilerplate duplication. The Jaro algorithm was manually verified against the martha/marhta reference (0.9444). Levenshtein's two-row space optimization is correct and symmetric. Jaro-Winkler's mul_add formula matches the standard. The OSA variant disclaimer is present in the Python docstring. Tests cover all critical edge cases including null propagation and scalar broadcasting in both directions.

No files require special attention.

Important Files Changed

Filename Overview
src/daft-functions-utf8/src/utils.rs Adds binary_str_distance and binary_str_distance_to_field shared helpers; correctly handles null tracking via NullBufferBuilder, scalar broadcasting, full-null short-circuit, and empty array cases.
src/daft-functions-utf8/src/levenshtein.rs Correct Wagner-Fischer implementation using two-row O(min(n,m)) space; safely swaps shorter/longer strings for inner-loop efficiency.
src/daft-functions-utf8/src/jaro.rs Correct Jaro implementation; match_distance uses saturating_sub to avoid underflow; transposition counting loop verified against martha/marhta reference (0.9444).
src/daft-functions-utf8/src/jaro_winkler.rs Correct Jaro-Winkler formula via mul_add; prefix-length capped at 4 characters; reuses compute_jaro_similarity to avoid duplication.
src/daft-functions-utf8/src/damerau_levenshtein.rs Implements OSA variant; correctly documented at both Rust and Python docstring level including the concrete CA→ABC divergence example.
daft/functions/str.py Four new Python wrappers with complete docstrings, doctests, type annotations, and OSA disclaimer; follows hamming_distance_str pattern exactly.
daft/expressions/expressions.py Adds four Expression methods using inline imports, consistent with the existing hamming/regexp_count/length_bytes pattern in the same file.
tests/functions/test_string_distance.py 24 tests covering all four functions: correctness, empty strings, null propagation, known reference values, scalar broadcasting both directions, and expression-method equivalence.

Reviews (3): Last reviewed commit: "Merge branch 'main' into feat/string-dis..." | Re-trigger Greptile

Comment thread src/daft-functions-utf8/src/levenshtein.rs Outdated
…ions

- document damerau_levenshtein_distance computes OSA variant, noting it differs from true Damerau-Levenshtein for overlapping transpositions
- extract shared binary_str_distance and binary_str_distance_to_field helpers in utils.rs
- collapse identical call/get_return_field boilerplate across the 4 UDFs into the generic helpers (-91 net lines)
- update Rust docstring for damerau_levenshtein to note OSA variant
@cckellogg

Copy link
Copy Markdown
Contributor

@greptileai

Comment thread src/daft-functions-utf8/src/utils.rs Outdated
let mut values = Vec::with_capacity(len);
let mut validity = NullBufferBuilder::new(len);

for i in 0..len {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution. I think binary_str_distance might need to use the existing broadcast-aware UTF-8 input handling here.

Right now it sets len = left.len() and then indexes both arrays with left.get(i) / right.get(i). That works for column-column inputs with equal lengths, but it breaks valid scalar broadcast cases like:

levenshtein_distance(col("a"), "kitten")
levenshtein_distance("kitten", col("a"))

For col("a"), "kitten", right has length 1, so right.get(i) can go out of bounds once i > 0. For "kitten", col("a"), the helper uses left.len() as the output length, so it can produce only one row instead of broadcasting the left scalar across the column.

Could we reuse the same pattern as the nearby UTF-8 helpers, parse_inputs to compute/validate the expected output size, then create_broadcasted_str_iter for both sides? find_impl does something similar. Roughly it would looks like this:

left.with_utf8_array(|left| {
  right.with_utf8_array(|right| {
      let (is_full_null, expected_size) = parse_inputs(left, &[right])
          .map_err(|e| DaftError::ValueError(format!("Error in {name}: {e}")))?;

      if is_full_null {
          return Ok(
              DataArray::<T>::full_null(name, &return_dtype, expected_size).into_series(),
          );
      }
      if expected_size == 0 {
          return Ok(DataArray::<T>::empty(name, &return_dtype).into_series());
      }

      let left_iter = create_broadcasted_str_iter(left, expected_size);
      let right_iter = create_broadcasted_str_iter(right, expected_size);

      let mut values = Vec::with_capacity(expected_size);
      let mut validity = NullBufferBuilder::new(expected_size);

      for (l, r) in left_iter.zip(right_iter) {
          match (l, r) {
              (Some(l), Some(r)) => {
                  values.push(compute(l, r));
                  validity.append_non_null();
              }
              _ => {
                  values.push(T::Native::default());
                  validity.append_null();
              }
          }
      }

      let result = DataArray::<T>::from_field_and_values(field.clone(), values)
          .with_nulls(validity.finish())?;
      Ok(result.into_series())
  })
})

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I've addressed this as per your suggestions. I confirmed that it resolved the OOB and length issues. Also added regression tests for scalar broadcasts.

nish2292 and others added 3 commits June 4, 2026 17:03
- rewrite binary_str_distance to use parse_inputs + create_broadcasted_str_iter, matching the broadcast-aware pattern of other utf8 helpers
- fixes out-of-bounds panic for col-scalar (e.g. levenshtein_distance(col("a"), "kitten")) and wrong-length output for scalar-col inputs
- handle full-null and empty-input cases explicitly
- add TestScalarBroadcast regression tests covering col-scalar, scalar-col, and null-scalar
- addresses maintainer review feedback (PR Eventual-Inc#7068)
@codspeed-hq

codspeed-hq Bot commented Jun 5, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 40 untouched benchmarks
⏩ 10 skipped benchmarks1


Comparing nish2292:feat/string-distance-functions (1651d2e) with main (afa221d)

Open in CodSpeed

Footnotes

  1. 10 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@cckellogg

Copy link
Copy Markdown
Contributor

@greptileai

@cckellogg cckellogg merged commit 806719a into Eventual-Inc:main Jun 9, 2026
93 of 97 checks passed
@nish2292 nish2292 deleted the feat/string-distance-functions branch June 9, 2026 16:51
chenghuichen added a commit to chenghuichen/Daft that referenced this pull request Jun 10, 2026
* origin/main: (115 commits)
  feat: add ignore_corrupt_files option to read_parquet, read_csv and read_iceberg (Eventual-Inc#6520)
  fix(deps): gate vllm to Linux so macOS/Windows resolve without CUDA wheels (Eventual-Inc#7095)
  fix: pass options in Gravitino PostgreSQL read method (Eventual-Inc#7047)
  feat(ray): Implement dynamic scale-in for RaySwordfishActor (Eventual-Inc#5903)
  feat(delta-lake): support column mapping for reads (Eventual-Inc#7005)
  feat(functions): add string distance/similarity functions (Eventual-Inc#7068)
  test(parquet): cover read_parquet edge cases (Eventual-Inc#7085)
  refactor(checkpoint): drop "seal" vocabulary from Rust API surface (Eventual-Inc#7078)
  fix(asof-join): use unknown clustering spec instead of hash           (Eventual-Inc#7075)
  docs: standardize Slack links to use daft.ai/slack (Eventual-Inc#7066)
  feat: add try_cast function for safe type conversion (Eventual-Inc#6960)
  refactor(file): rename File byte-range fields to position/size (Eventual-Inc#6747)
  fix(ray): configure worker startup timeout on runner (Eventual-Inc#7055)
  feat(shuffle): default flight shuffle compression to lz4 (Eventual-Inc#7071)
  feat(iceberg): support branch and tag reads (Eventual-Inc#7042)
  fix(shuffle): concat recordbatches before repartition (Eventual-Inc#7064)
  perf: update jemalloc 5.3.0 → 5.3.1 to fix muzzy decay performance bug (Eventual-Inc#7059)
  feat: thread assume_sorted_and_aligned_partitions parameter through ASOF join (Eventual-Inc#7067)
  fix(flight-shuffle): reduce coordinator memory to O(map_tasks + partitions) (Eventual-Inc#7056)
  refactor(distributed): rename needs_hash_repartition to can_skip_hash_repartition      (Eventual-Inc#7053)
  ...

# Conflicts:
#	daft/checkpoint.py
#	src/daft-distributed/src/pipeline_node/limit.rs
#	src/daft-distributed/src/pipeline_node/stage_checkpoint_keys.rs
#	src/daft-distributed/src/scheduling/task.rs
#	src/daft-local-execution/src/pipeline.rs
#	src/daft-local-execution/src/sinks/blocking_sink.rs
#	src/daft-local-execution/src/sources/scan_task.rs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(functions): string distance/similarity functions

2 participants