feat(functions): add string distance/similarity functions#7068
Conversation
- add levenshtein_distance, jaro_similarity, jaro_winkler_similarity, damerau_levenshtein_distance - pure Rust implementations with no external dependencies, following hamming_distance_str pattern - expose as top-level daft.functions API and Expression methods - handle null inputs (return null) and null-typed columns (DataType::Null) - include 24 pytest test cases covering correctness, edge cases, and null handling
- apply rustfmt to levenshtein.rs, jaro.rs, damerau_levenshtein.rs - apply ruff format to test_string_distance.py - fix jaro_similarity and jaro_winkler_similarity docstring examples to use full-precision Float64 output
- use i64::from(bool) instead of if/else for boolean-to-int conversion - use iter_mut().enumerate() instead of indexing loop (needless_range_loop) - use mul_add for jaro-winkler formula (suboptimal_flops) - replace "abd" with "acb" in docstring example (spellcheck flagged "abd")
|
Seems like the failure in the PRBs are unrelated like in #7063 caused by a non-deterministic seeded shuffle.. |
Greptile SummaryThis PR adds four pairwise string distance/similarity functions —
Confidence Score: 5/5Safe to merge — all four algorithms are correctly implemented, null propagation and scalar broadcasting work as intended, and the OSA distinction for Damerau-Levenshtein is disclosed at both the Rust and Python documentation levels. The shared binary_str_distance helper in utils.rs eliminates the previously flagged boilerplate duplication. The Jaro algorithm was manually verified against the martha/marhta reference (0.9444). Levenshtein's two-row space optimization is correct and symmetric. Jaro-Winkler's mul_add formula matches the standard. The OSA variant disclaimer is present in the Python docstring. Tests cover all critical edge cases including null propagation and scalar broadcasting in both directions. No files require special attention. Important Files Changed
Reviews (3): Last reviewed commit: "Merge branch 'main' into feat/string-dis..." | Re-trigger Greptile |
…ions - document damerau_levenshtein_distance computes OSA variant, noting it differs from true Damerau-Levenshtein for overlapping transpositions - extract shared binary_str_distance and binary_str_distance_to_field helpers in utils.rs - collapse identical call/get_return_field boilerplate across the 4 UDFs into the generic helpers (-91 net lines) - update Rust docstring for damerau_levenshtein to note OSA variant
| let mut values = Vec::with_capacity(len); | ||
| let mut validity = NullBufferBuilder::new(len); | ||
|
|
||
| for i in 0..len { |
There was a problem hiding this comment.
Thanks for the contribution. I think binary_str_distance might need to use the existing broadcast-aware UTF-8 input handling here.
Right now it sets len = left.len() and then indexes both arrays with left.get(i) / right.get(i). That works for column-column inputs with equal lengths, but it breaks valid scalar broadcast cases like:
levenshtein_distance(col("a"), "kitten")
levenshtein_distance("kitten", col("a"))
For col("a"), "kitten", right has length 1, so right.get(i) can go out of bounds once i > 0. For "kitten", col("a"), the helper uses left.len() as the output length, so it can produce only one row instead of broadcasting the left scalar across the column.
Could we reuse the same pattern as the nearby UTF-8 helpers, parse_inputs to compute/validate the expected output size, then create_broadcasted_str_iter for both sides? find_impl does something similar. Roughly it would looks like this:
left.with_utf8_array(|left| {
right.with_utf8_array(|right| {
let (is_full_null, expected_size) = parse_inputs(left, &[right])
.map_err(|e| DaftError::ValueError(format!("Error in {name}: {e}")))?;
if is_full_null {
return Ok(
DataArray::<T>::full_null(name, &return_dtype, expected_size).into_series(),
);
}
if expected_size == 0 {
return Ok(DataArray::<T>::empty(name, &return_dtype).into_series());
}
let left_iter = create_broadcasted_str_iter(left, expected_size);
let right_iter = create_broadcasted_str_iter(right, expected_size);
let mut values = Vec::with_capacity(expected_size);
let mut validity = NullBufferBuilder::new(expected_size);
for (l, r) in left_iter.zip(right_iter) {
match (l, r) {
(Some(l), Some(r)) => {
values.push(compute(l, r));
validity.append_non_null();
}
_ => {
values.push(T::Native::default());
validity.append_null();
}
}
}
let result = DataArray::<T>::from_field_and_values(field.clone(), values)
.with_nulls(validity.finish())?;
Ok(result.into_series())
})
})
There was a problem hiding this comment.
Good catch, I've addressed this as per your suggestions. I confirmed that it resolved the OOB and length issues. Also added regression tests for scalar broadcasts.
- rewrite binary_str_distance to use parse_inputs + create_broadcasted_str_iter, matching the broadcast-aware pattern of other utf8 helpers
- fixes out-of-bounds panic for col-scalar (e.g. levenshtein_distance(col("a"), "kitten")) and wrong-length output for scalar-col inputs
- handle full-null and empty-input cases explicitly
- add TestScalarBroadcast regression tests covering col-scalar, scalar-col, and null-scalar
- addresses maintainer review feedback (PR Eventual-Inc#7068)
Merging this PR will not alter performance
Comparing Footnotes
|
* origin/main: (115 commits) feat: add ignore_corrupt_files option to read_parquet, read_csv and read_iceberg (Eventual-Inc#6520) fix(deps): gate vllm to Linux so macOS/Windows resolve without CUDA wheels (Eventual-Inc#7095) fix: pass options in Gravitino PostgreSQL read method (Eventual-Inc#7047) feat(ray): Implement dynamic scale-in for RaySwordfishActor (Eventual-Inc#5903) feat(delta-lake): support column mapping for reads (Eventual-Inc#7005) feat(functions): add string distance/similarity functions (Eventual-Inc#7068) test(parquet): cover read_parquet edge cases (Eventual-Inc#7085) refactor(checkpoint): drop "seal" vocabulary from Rust API surface (Eventual-Inc#7078) fix(asof-join): use unknown clustering spec instead of hash (Eventual-Inc#7075) docs: standardize Slack links to use daft.ai/slack (Eventual-Inc#7066) feat: add try_cast function for safe type conversion (Eventual-Inc#6960) refactor(file): rename File byte-range fields to position/size (Eventual-Inc#6747) fix(ray): configure worker startup timeout on runner (Eventual-Inc#7055) feat(shuffle): default flight shuffle compression to lz4 (Eventual-Inc#7071) feat(iceberg): support branch and tag reads (Eventual-Inc#7042) fix(shuffle): concat recordbatches before repartition (Eventual-Inc#7064) perf: update jemalloc 5.3.0 → 5.3.1 to fix muzzy decay performance bug (Eventual-Inc#7059) feat: thread assume_sorted_and_aligned_partitions parameter through ASOF join (Eventual-Inc#7067) fix(flight-shuffle): reduce coordinator memory to O(map_tasks + partitions) (Eventual-Inc#7056) refactor(distributed): rename needs_hash_repartition to can_skip_hash_repartition (Eventual-Inc#7053) ... # Conflicts: # daft/checkpoint.py # src/daft-distributed/src/pipeline_node/limit.rs # src/daft-distributed/src/pipeline_node/stage_checkpoint_keys.rs # src/daft-distributed/src/scheduling/task.rs # src/daft-local-execution/src/pipeline.rs # src/daft-local-execution/src/sinks/blocking_sink.rs # src/daft-local-execution/src/sources/scan_task.rs
Changes Made
Add four pairwise string distance/similarity functions as pure Rust scalar UDFs:
levenshtein_distance- minimum edit distance (Int64)jaro_similarity- similarity score 0.0-1.0 (Float64)jaro_winkler_similarity- Jaro with prefix bonus (Float64)damerau_levenshtein_distance- Levenshtein + transpositions (Int64)Follows the existing
hamming_distance_strpattern. No external dependencies. Exposed viadaft.functionsAPI and as Expression methods. Null-safe (returns null when either input is null).Related Issues
Fixes #6794
Test Plan
24 pytest test cases in
tests/functions/test_string_distance.py:Rust compilation verified:
AI Disclosure
AI-assisted implementation (Claude Opus 4.6).