feat(string): add Spark-compatible find_in_set, overlay, url_encode, url_decode and levenshtein alias by XuQianJin-Stars · Pull Request #7138 · Eventual-Inc/Daft

XuQianJin-Stars · 2026-06-17T04:13:21Z

Summary

Adds five Spark-compatible string functions to close the API parity gap tracked in #7137 and SPARK_FUNCTION_COMPARISON.md.

find_in_set(str, str_array): 1-based comma-separated lookup; returns 0 when not found or when str contains a comma; null-safe (NULL in → NULL out).
overlay(input, replace, pos[, len]): replace substring starting at 1-based pos with replace; len < 0 or omitted falls back to char_length(replace); pos < 1 is clamped to 1; pos beyond the string length appends. Works for both Utf8 and Binary.
url_encode / url_decode: application/x-www-form-urlencoded semantics matching Spark (space ↔ +, %XX percent-encoding); UTF-8 round-trip safe; url_decode errors on malformed %-sequences (Spark-compatible).
levenshtein: register a Spark-style alias for the existing levenshtein_distance so Spark / Spark-Connect SQL works out of the box.

Changes Made

Implementation lives in daft-functions-utf8 and is wired through Rust → SQL → Python so it is available from DataFrame, SQL, and Spark Connect.

Rust core
- src/daft-functions-utf8/src/find_in_set.rs
- src/daft-functions-utf8/src/overlay.rs
- src/daft-functions-utf8/src/url.rs (url_encode, url_decode)
- levenshtein alias added to the existing levenshtein_distance ScalarUDF.
- Vectorized arrow2 paths with explicit NULL handling and UTF-8-safe character indexing.
SQL
- Auto-registered via the Utf8Functions module (add_fn + aliases()), so SELECT find_in_set(...), SELECT overlay(...) , SELECT url_encode(...), SELECT url_decode(...), SELECT levenshtein(a, b) all work.
Python API
- New wrappers in daft/functions/str.py, re-exported from daft/functions/__init__.py.
- Available as both top-level functions and Expression.str.* accessors where applicable.
Docs
- SPARK_FUNCTION_COMPARISON.md updated to mark these five as ✅ implemented.

Test Coverage

49 Rust unit tests covering NULL propagation, empty inputs, ASCII, multi-byte UTF-8, edge cases for pos / len in overlay, comma-in-needle for find_in_set, + ↔ space and %-decoding round-trips for url_encode/url_decode.
19 Python integration tests across DataFrame API and SQL surfaces.
All tests pass locally; cargo fmt / cargo clippy / pre-commit clean.

Spark Compatibility

Behavior verified against Spark 3.5 reference:

Function	Spark example	Result
`find_in_set`	`find_in_set('b', 'a,b,c')`	`2`
`find_in_set`	`find_in_set('a,b', 'a,b,c')`	`0`
`overlay`	`overlay('Spark SQL' PLACING '_' FROM 6)`	`'Spark_SQL'`
`overlay`	`overlay('Spark SQL', 'CORE', 7, 0)`	`'Spark COREsql'` style position semantics
`url_encode`	`url_encode('a b+c')`	`'a+b%2Bc'`
`url_decode`	`url_decode('a+b%2Bc')`	`'a b+c'`
`levenshtein`	`levenshtein('kitten', 'sitting')`	`3`

Related Issues

Closes #7137

…url_decode and levenshtein alias - find_in_set(str, str_array): 1-based comma-separated lookup, 0 when not found or needle contains comma; null-safe. - overlay(input, replace, pos[, len]): replace substring starting at 1-based pos with replace; len <0 or omitted falls back to char length of replace; pos<1 is clamped to 1; pos beyond string appends. - url_encode / url_decode: application/x-www-form-urlencoded semantics matching Spark (space <-> +, %XX percent encoding); UTF-8 round-trip safe. - levenshtein: register Spark-style alias for existing levenshtein_distance. Coverage: - Rust core (src/daft-functions-utf8/src/{find_in_set,overlay,url}.rs) - SQL registration via Utf8Functions module (auto via add_fn + aliases()) - Python API (daft/functions/str.py + daft/functions/__init__.py) - 49 Rust unit tests + 19 Python integration tests, all passing.

greptile-apps · 2026-06-17T04:17:49Z

Greptile Summary

This PR adds five Spark-compatible string functions (find_in_set, overlay, url_encode, url_decode, and a levenshtein SQL alias) implemented end-to-end in Rust, wired through to SQL and the Python DataFrame API to close the gap tracked in #7137.

find_in_set and url_encode/url_decode are cleanly implemented with correct null propagation, broadcasting, and UTF-8 safety.
overlay has a null-propagation bug: when len is supplied as a column with null values, those nulls silently fall back to the "use replace length" code path rather than producing NULL output, diverging from Spark semantics.
levenshtein alias is a minimal, correct change adding one aliases() override to the existing UDF.

Confidence Score: 4/5

Safe to merge once the overlay null-propagation issue for the len column is addressed; all other new functions behave correctly.

Three of the four new functions (find_in_set, url_encode/url_decode, levenshtein alias) are solid. overlay has a defect in how null values inside the len column are handled — they silently produce computed output instead of propagating null, diverging from documented Spark semantics. No test covers a len column with individual nulls, so the bug would not be caught by the existing suite.

src/daft-functions-utf8/src/overlay.rs — the null-propagation logic for the optional len argument needs a fix.

Important Files Changed

Filename	Overview
src/daft-functions-utf8/src/overlay.rs	New overlay scalar UDF; NULL propagation for the optional len argument is broken — null values in a len column silently fall back to replace_char_len instead of propagating null.
src/daft-functions-utf8/src/find_in_set.rs	New find_in_set scalar UDF; correct 1-based lookup, comma-in-needle guard, null propagation, and broadcasting all look correct.
src/daft-functions-utf8/src/url.rs	New url_encode / url_decode UDFs; application/x-www-form-urlencoded semantics, percent-encoding boundary checks, and round-trip UTF-8 safety all look correct.
src/daft-functions-utf8/src/levenshtein.rs	Adds levenshtein SQL alias to the existing LevenshteinDistance UDF; minimal and correct.
src/daft-functions-utf8/src/lib.rs	Registers the four new modules and their UDFs with Utf8Functions; mechanical and correct.
daft/functions/str.py	Python wrappers for find_in_set, overlay, url_encode, url_decode with good docstrings; overlay correctly lifts int literals to lit().
daft/functions/init.py	New functions re-exported and added to all in correct alphabetical order.
tests/functions/test_spark_string_functions.py	19 integration tests covering the happy path, edge cases, null propagation, and SQL alias; no test exercises a len column with individual null values, which would expose the null-propagation bug.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    PY["Python API\n(daft/functions/str.py)"]
    SQL["SQL Layer\n(Utf8Functions module)"]
    RUST["Rust ScalarUDF\n(daft-functions-utf8)"]

    PY -->|"_call_builtin_scalar_fn(name, ...)"| SQL
    SQL -->|"add_fn / aliases()"| RUST

    subgraph RUST
        FIS["FindInSet\nfind_in_set.rs"]
        OVL["Overlay\noverlay.rs"]
        URL["UrlEncode / UrlDecode\nurl.rs"]
        LEV["LevenshteinDistance\n+ levenshtein alias\nlevenshtein.rs"]
    end

    OVL -->|"compute_overlay()"| CO["char-indexed\nString builder"]
    FIS -->|"compute_find_in_set()"| SPLIT["split(',') iterator"]
    URL -->|"encode_url_form()"| ENC["percent-encode UTF-8 bytes"]
    URL -->|"decode_url_form()"| DEC["percent-decode bytes to UTF-8"]

%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    PY["Python API\n(daft/functions/str.py)"]
    SQL["SQL Layer\n(Utf8Functions module)"]
    RUST["Rust ScalarUDF\n(daft-functions-utf8)"]

    PY -->|"_call_builtin_scalar_fn(name, ...)"| SQL
    SQL -->|"add_fn / aliases()"| RUST

    subgraph RUST
        FIS["FindInSet\nfind_in_set.rs"]
        OVL["Overlay\noverlay.rs"]
        URL["UrlEncode / UrlDecode\nurl.rs"]
        LEV["LevenshteinDistance\n+ levenshtein alias\nlevenshtein.rs"]
    end

    OVL -->|"compute_overlay()"| CO["char-indexed\nString builder"]
    FIS -->|"compute_find_in_set()"| SPLIT["split(',') iterator"]
    URL -->|"encode_url_form()"| ENC["percent-encode UTF-8 bytes"]
    URL -->|"decode_url_form()"| DEC["percent-decode bytes to UTF-8"]

Comments Outside Diff (1)

src/daft-functions-utf8/src/overlay.rs, line 578-609 (link)

NULL propagation broken for len argument

When len is explicitly provided as a column that contains null values, those nulls should propagate to the output (Spark semantics: overlay('abc', 'X', 1, NULL) → NULL). Two separate gaps allow nulls in len to silently fall through to the "omitted len" code path:
1. any_full_null never checks len_series, so an all-null len column skips the full-null short-circuit and reaches the computation loop.
2. The mapping closure matches only on (i, r, p) — l is unconditionally passed through regardless of nullity. When l = None (a null row in the len column) the call becomes compute_overlay(i, r, p, None), which silently uses replace_char_len instead of returning None.

_{Reviews (1): Last reviewed commit: "feat(string): add Spark-compatible find_..." | Re-trigger Greptile}

greptile-apps · 2026-06-17T04:17:54Z

+            return Err(format!(
+                "broadcast length mismatch: {input_len} vs {others:?}"
+            ));


The error message reports input_len and the whole others slice, but the mismatch is actually between the accumulated result_len and the current l. This makes the diagnostic confusing.

Suggested change

return Err(format!(

"broadcast length mismatch: {input_len} vs {others:?}"

));

return Err(format!(

"broadcast length mismatch: {result_len} vs {l}"

));

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

… message

github-actions Bot added the feat label Jun 17, 2026

greptile-apps Bot reviewed Jun 17, 2026

View reviewed changes

fix: address review comments for url_encode doctest and overlay error…

83f6325

… message

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(string): add Spark-compatible find_in_set, overlay, url_encode, url_decode and levenshtein alias#7138

feat(string): add Spark-compatible find_in_set, overlay, url_encode, url_decode and levenshtein alias#7138
XuQianJin-Stars wants to merge 2 commits into
Eventual-Inc:mainfrom
XuQianJin-Stars:feat/spark-string-functions

XuQianJin-Stars commented Jun 17, 2026

Uh oh!

greptile-apps Bot commented Jun 17, 2026 •

edited

Loading

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

XuQianJin-Stars commented Jun 17, 2026

Summary

Changes Made

Test Coverage

Spark Compatibility

Related Issues

Uh oh!

greptile-apps Bot commented Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Comments Outside Diff (1)

Uh oh!

greptile-apps Bot Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

greptile-apps Bot commented Jun 17, 2026 •

edited

Loading