Skip to content

feat(string): add Spark-compatible find_in_set, overlay, url_encode, url_decode and levenshtein alias#7138

Open
XuQianJin-Stars wants to merge 2 commits into
Eventual-Inc:mainfrom
XuQianJin-Stars:feat/spark-string-functions
Open

feat(string): add Spark-compatible find_in_set, overlay, url_encode, url_decode and levenshtein alias#7138
XuQianJin-Stars wants to merge 2 commits into
Eventual-Inc:mainfrom
XuQianJin-Stars:feat/spark-string-functions

Conversation

@XuQianJin-Stars

Copy link
Copy Markdown
Contributor

Summary

Adds five Spark-compatible string functions to close the API parity gap tracked in #7137 and SPARK_FUNCTION_COMPARISON.md.

  • find_in_set(str, str_array): 1-based comma-separated lookup; returns 0 when not found or when str contains a comma; null-safe (NULL in → NULL out).
  • overlay(input, replace, pos[, len]): replace substring starting at 1-based pos with replace; len < 0 or omitted falls back to char_length(replace); pos < 1 is clamped to 1; pos beyond the string length appends. Works for both Utf8 and Binary.
  • url_encode / url_decode: application/x-www-form-urlencoded semantics matching Spark (space ↔ +, %XX percent-encoding); UTF-8 round-trip safe; url_decode errors on malformed %-sequences (Spark-compatible).
  • levenshtein: register a Spark-style alias for the existing levenshtein_distance so Spark / Spark-Connect SQL works out of the box.

Changes Made

Implementation lives in daft-functions-utf8 and is wired through Rust → SQL → Python so it is available from DataFrame, SQL, and Spark Connect.

  • Rust core
    • src/daft-functions-utf8/src/find_in_set.rs
    • src/daft-functions-utf8/src/overlay.rs
    • src/daft-functions-utf8/src/url.rs (url_encode, url_decode)
    • levenshtein alias added to the existing levenshtein_distance ScalarUDF.
    • Vectorized arrow2 paths with explicit NULL handling and UTF-8-safe character indexing.
  • SQL
    • Auto-registered via the Utf8Functions module (add_fn + aliases()), so SELECT find_in_set(...), SELECT overlay(...) , SELECT url_encode(...), SELECT url_decode(...), SELECT levenshtein(a, b) all work.
  • Python API
    • New wrappers in daft/functions/str.py, re-exported from daft/functions/__init__.py.
    • Available as both top-level functions and Expression.str.* accessors where applicable.
  • Docs
    • SPARK_FUNCTION_COMPARISON.md updated to mark these five as ✅ implemented.

Test Coverage

  • 49 Rust unit tests covering NULL propagation, empty inputs, ASCII, multi-byte UTF-8, edge cases for pos / len in overlay, comma-in-needle for find_in_set, + ↔ space and %-decoding round-trips for url_encode/url_decode.
  • 19 Python integration tests across DataFrame API and SQL surfaces.
  • All tests pass locally; cargo fmt / cargo clippy / pre-commit clean.

Spark Compatibility

Behavior verified against Spark 3.5 reference:

Function Spark example Result
find_in_set find_in_set('b', 'a,b,c') 2
find_in_set find_in_set('a,b', 'a,b,c') 0
overlay overlay('Spark SQL' PLACING '_' FROM 6) 'Spark_SQL'
overlay overlay('Spark SQL', 'CORE', 7, 0) 'Spark COREsql' style position semantics
url_encode url_encode('a b+c') 'a+b%2Bc'
url_decode url_decode('a+b%2Bc') 'a b+c'
levenshtein levenshtein('kitten', 'sitting') 3

Related Issues

Closes #7137

…url_decode and levenshtein alias

- find_in_set(str, str_array): 1-based comma-separated lookup, 0 when not found or
  needle contains comma; null-safe.
- overlay(input, replace, pos[, len]): replace substring starting at 1-based pos
  with replace; len <0 or omitted falls back to char length of replace; pos<1 is
  clamped to 1; pos beyond string appends.
- url_encode / url_decode: application/x-www-form-urlencoded semantics matching
  Spark (space <-> +, %XX percent encoding); UTF-8 round-trip safe.
- levenshtein: register Spark-style alias for existing levenshtein_distance.

Coverage:
- Rust core (src/daft-functions-utf8/src/{find_in_set,overlay,url}.rs)
- SQL registration via Utf8Functions module (auto via add_fn + aliases())
- Python API (daft/functions/str.py + daft/functions/__init__.py)
- 49 Rust unit tests + 19 Python integration tests, all passing.
@github-actions github-actions Bot added the feat label Jun 17, 2026
@greptile-apps

greptile-apps Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds five Spark-compatible string functions (find_in_set, overlay, url_encode, url_decode, and a levenshtein SQL alias) implemented end-to-end in Rust, wired through to SQL and the Python DataFrame API to close the gap tracked in #7137.

  • find_in_set and url_encode/url_decode are cleanly implemented with correct null propagation, broadcasting, and UTF-8 safety.
  • overlay has a null-propagation bug: when len is supplied as a column with null values, those nulls silently fall back to the "use replace length" code path rather than producing NULL output, diverging from Spark semantics.
  • levenshtein alias is a minimal, correct change adding one aliases() override to the existing UDF.

Confidence Score: 4/5

Safe to merge once the overlay null-propagation issue for the len column is addressed; all other new functions behave correctly.

Three of the four new functions (find_in_set, url_encode/url_decode, levenshtein alias) are solid. overlay has a defect in how null values inside the len column are handled — they silently produce computed output instead of propagating null, diverging from documented Spark semantics. No test covers a len column with individual nulls, so the bug would not be caught by the existing suite.

src/daft-functions-utf8/src/overlay.rs — the null-propagation logic for the optional len argument needs a fix.

Important Files Changed

Filename Overview
src/daft-functions-utf8/src/overlay.rs New overlay scalar UDF; NULL propagation for the optional len argument is broken — null values in a len column silently fall back to replace_char_len instead of propagating null.
src/daft-functions-utf8/src/find_in_set.rs New find_in_set scalar UDF; correct 1-based lookup, comma-in-needle guard, null propagation, and broadcasting all look correct.
src/daft-functions-utf8/src/url.rs New url_encode / url_decode UDFs; application/x-www-form-urlencoded semantics, percent-encoding boundary checks, and round-trip UTF-8 safety all look correct.
src/daft-functions-utf8/src/levenshtein.rs Adds levenshtein SQL alias to the existing LevenshteinDistance UDF; minimal and correct.
src/daft-functions-utf8/src/lib.rs Registers the four new modules and their UDFs with Utf8Functions; mechanical and correct.
daft/functions/str.py Python wrappers for find_in_set, overlay, url_encode, url_decode with good docstrings; overlay correctly lifts int literals to lit().
daft/functions/init.py New functions re-exported and added to all in correct alphabetical order.
tests/functions/test_spark_string_functions.py 19 integration tests covering the happy path, edge cases, null propagation, and SQL alias; no test exercises a len column with individual null values, which would expose the null-propagation bug.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    PY["Python API\n(daft/functions/str.py)"]
    SQL["SQL Layer\n(Utf8Functions module)"]
    RUST["Rust ScalarUDF\n(daft-functions-utf8)"]

    PY -->|"_call_builtin_scalar_fn(name, ...)"| SQL
    SQL -->|"add_fn / aliases()"| RUST

    subgraph RUST
        FIS["FindInSet\nfind_in_set.rs"]
        OVL["Overlay\noverlay.rs"]
        URL["UrlEncode / UrlDecode\nurl.rs"]
        LEV["LevenshteinDistance\n+ levenshtein alias\nlevenshtein.rs"]
    end

    OVL -->|"compute_overlay()"| CO["char-indexed\nString builder"]
    FIS -->|"compute_find_in_set()"| SPLIT["split(',') iterator"]
    URL -->|"encode_url_form()"| ENC["percent-encode UTF-8 bytes"]
    URL -->|"decode_url_form()"| DEC["percent-decode bytes to UTF-8"]
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    PY["Python API\n(daft/functions/str.py)"]
    SQL["SQL Layer\n(Utf8Functions module)"]
    RUST["Rust ScalarUDF\n(daft-functions-utf8)"]

    PY -->|"_call_builtin_scalar_fn(name, ...)"| SQL
    SQL -->|"add_fn / aliases()"| RUST

    subgraph RUST
        FIS["FindInSet\nfind_in_set.rs"]
        OVL["Overlay\noverlay.rs"]
        URL["UrlEncode / UrlDecode\nurl.rs"]
        LEV["LevenshteinDistance\n+ levenshtein alias\nlevenshtein.rs"]
    end

    OVL -->|"compute_overlay()"| CO["char-indexed\nString builder"]
    FIS -->|"compute_find_in_set()"| SPLIT["split(',') iterator"]
    URL -->|"encode_url_form()"| ENC["percent-encode UTF-8 bytes"]
    URL -->|"decode_url_form()"| DEC["percent-decode bytes to UTF-8"]
Loading

Comments Outside Diff (1)

  1. src/daft-functions-utf8/src/overlay.rs, line 578-609 (link)

    P1 NULL propagation broken for len argument

    When len is explicitly provided as a column that contains null values, those nulls should propagate to the output (Spark semantics: overlay('abc', 'X', 1, NULL) → NULL). Two separate gaps allow nulls in len to silently fall through to the "omitted len" code path:

    1. any_full_null never checks len_series, so an all-null len column skips the full-null short-circuit and reaches the computation loop.
    2. The mapping closure matches only on (i, r, p)l is unconditionally passed through regardless of nullity. When l = None (a null row in the len column) the call becomes compute_overlay(i, r, p, None), which silently uses replace_char_len instead of returning None.

Reviews (1): Last reviewed commit: "feat(string): add Spark-compatible find_..." | Re-trigger Greptile

Comment thread src/daft-functions-utf8/src/overlay.rs Outdated
Comment on lines +206 to +208
return Err(format!(
"broadcast length mismatch: {input_len} vs {others:?}"
));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The error message reports input_len and the whole others slice, but the mismatch is actually between the accumulated result_len and the current l. This makes the diagnostic confusing.

Suggested change
return Err(format!(
"broadcast length mismatch: {input_len} vs {others:?}"
));
return Err(format!(
"broadcast length mismatch: {result_len} vs {l}"
));

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Add Spark-compatible string functions: find_in_set, levenshtein, overlay, url_decode, url_encode

1 participant