feat(string): add Spark-compatible find_in_set, overlay, url_encode, url_decode and levenshtein alias#7138
Conversation
…url_decode and levenshtein alias
- find_in_set(str, str_array): 1-based comma-separated lookup, 0 when not found or
needle contains comma; null-safe.
- overlay(input, replace, pos[, len]): replace substring starting at 1-based pos
with replace; len <0 or omitted falls back to char length of replace; pos<1 is
clamped to 1; pos beyond string appends.
- url_encode / url_decode: application/x-www-form-urlencoded semantics matching
Spark (space <-> +, %XX percent encoding); UTF-8 round-trip safe.
- levenshtein: register Spark-style alias for existing levenshtein_distance.
Coverage:
- Rust core (src/daft-functions-utf8/src/{find_in_set,overlay,url}.rs)
- SQL registration via Utf8Functions module (auto via add_fn + aliases())
- Python API (daft/functions/str.py + daft/functions/__init__.py)
- 49 Rust unit tests + 19 Python integration tests, all passing.
Greptile SummaryThis PR adds five Spark-compatible string functions (
Confidence Score: 4/5Safe to merge once the overlay null-propagation issue for the len column is addressed; all other new functions behave correctly. Three of the four new functions (find_in_set, url_encode/url_decode, levenshtein alias) are solid. overlay has a defect in how null values inside the len column are handled — they silently produce computed output instead of propagating null, diverging from documented Spark semantics. No test covers a len column with individual nulls, so the bug would not be caught by the existing suite. src/daft-functions-utf8/src/overlay.rs — the null-propagation logic for the optional len argument needs a fix. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
PY["Python API\n(daft/functions/str.py)"]
SQL["SQL Layer\n(Utf8Functions module)"]
RUST["Rust ScalarUDF\n(daft-functions-utf8)"]
PY -->|"_call_builtin_scalar_fn(name, ...)"| SQL
SQL -->|"add_fn / aliases()"| RUST
subgraph RUST
FIS["FindInSet\nfind_in_set.rs"]
OVL["Overlay\noverlay.rs"]
URL["UrlEncode / UrlDecode\nurl.rs"]
LEV["LevenshteinDistance\n+ levenshtein alias\nlevenshtein.rs"]
end
OVL -->|"compute_overlay()"| CO["char-indexed\nString builder"]
FIS -->|"compute_find_in_set()"| SPLIT["split(',') iterator"]
URL -->|"encode_url_form()"| ENC["percent-encode UTF-8 bytes"]
URL -->|"decode_url_form()"| DEC["percent-decode bytes to UTF-8"]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
PY["Python API\n(daft/functions/str.py)"]
SQL["SQL Layer\n(Utf8Functions module)"]
RUST["Rust ScalarUDF\n(daft-functions-utf8)"]
PY -->|"_call_builtin_scalar_fn(name, ...)"| SQL
SQL -->|"add_fn / aliases()"| RUST
subgraph RUST
FIS["FindInSet\nfind_in_set.rs"]
OVL["Overlay\noverlay.rs"]
URL["UrlEncode / UrlDecode\nurl.rs"]
LEV["LevenshteinDistance\n+ levenshtein alias\nlevenshtein.rs"]
end
OVL -->|"compute_overlay()"| CO["char-indexed\nString builder"]
FIS -->|"compute_find_in_set()"| SPLIT["split(',') iterator"]
URL -->|"encode_url_form()"| ENC["percent-encode UTF-8 bytes"]
URL -->|"decode_url_form()"| DEC["percent-decode bytes to UTF-8"]
|
| return Err(format!( | ||
| "broadcast length mismatch: {input_len} vs {others:?}" | ||
| )); |
There was a problem hiding this comment.
The error message reports
input_len and the whole others slice, but the mismatch is actually between the accumulated result_len and the current l. This makes the diagnostic confusing.
| return Err(format!( | |
| "broadcast length mismatch: {input_len} vs {others:?}" | |
| )); | |
| return Err(format!( | |
| "broadcast length mismatch: {result_len} vs {l}" | |
| )); |
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Summary
Adds five Spark-compatible string functions to close the API parity gap tracked in #7137 and
SPARK_FUNCTION_COMPARISON.md.find_in_set(str, str_array): 1-based comma-separated lookup; returns0when not found or whenstrcontains a comma; null-safe (NULL in → NULL out).overlay(input, replace, pos[, len]): replace substring starting at 1-basedposwithreplace;len < 0or omitted falls back tochar_length(replace);pos < 1is clamped to1;posbeyond the string length appends. Works for bothUtf8andBinary.url_encode/url_decode:application/x-www-form-urlencodedsemantics matching Spark (space ↔+,%XXpercent-encoding); UTF-8 round-trip safe;url_decodeerrors on malformed%-sequences (Spark-compatible).levenshtein: register a Spark-style alias for the existinglevenshtein_distanceso Spark / Spark-Connect SQL works out of the box.Changes Made
Implementation lives in
daft-functions-utf8and is wired through Rust → SQL → Python so it is available from DataFrame, SQL, and Spark Connect.src/daft-functions-utf8/src/find_in_set.rssrc/daft-functions-utf8/src/overlay.rssrc/daft-functions-utf8/src/url.rs(url_encode,url_decode)levenshteinalias added to the existinglevenshtein_distanceScalarUDF.Utf8Functionsmodule (add_fn+aliases()), soSELECT find_in_set(...),SELECT overlay(...),SELECT url_encode(...),SELECT url_decode(...),SELECT levenshtein(a, b)all work.daft/functions/str.py, re-exported fromdaft/functions/__init__.py.Expression.str.*accessors where applicable.SPARK_FUNCTION_COMPARISON.mdupdated to mark these five as ✅ implemented.Test Coverage
pos/leninoverlay, comma-in-needle forfind_in_set,+↔ space and%-decoding round-trips forurl_encode/url_decode.cargo fmt/cargo clippy/pre-commitclean.Spark Compatibility
Behavior verified against Spark 3.5 reference:
find_in_setfind_in_set('b', 'a,b,c')2find_in_setfind_in_set('a,b', 'a,b,c')0overlayoverlay('Spark SQL' PLACING '_' FROM 6)'Spark_SQL'overlayoverlay('Spark SQL', 'CORE', 7, 0)'Spark COREsql'style position semanticsurl_encodeurl_encode('a b+c')'a+b%2Bc'url_decodeurl_decode('a+b%2Bc')'a b+c'levenshteinlevenshtein('kitten', 'sitting')3Related Issues
Closes #7137