Skip to content

PERF: vectorize is_range_indexer and is_sequence_range scans#65922

Open
jbrockmendel wants to merge 2 commits into
pandas-dev:mainfrom
jbrockmendel:perf-unroll-2
Open

PERF: vectorize is_range_indexer and is_sequence_range scans#65922
jbrockmendel wants to merge 2 commits into
pandas-dev:mainfrom
jbrockmendel:perf-unroll-2

Conversation

@jbrockmendel

Copy link
Copy Markdown
Member

Split out from GH-65298 (the has_sentinel half stays there).

4x-unrolls the inner scan loops of is_range_indexer and is_sequence_range in lib.pyx, combining lanes with bitwise | (not or) so the compiler can emit vectorized comparisons, with a scalar tail loop for the remainder. This mirrors the existing has_nans/all_nans unrolling (GH-65192).

has_infs was also unrolled in GH-65298 but is dropped here — benchmarking showed it flat (clang already vectorizes the float-compare loop, and it's memory-bound at scale), so the extra code wasn't earning its keep.

Benchmarks

Both the old (scalar) and new (unrolled) variants compiled into one extension with pandas' build flags (-O3 -std=c17 -DNDEBUG, baseline NEON on Apple Silicon), timed in-process, best-of-9. Common full-scan path (array is a range / sequence):

is_range_indexer (int64 — the no-op-indexer fast path in take/reset_index/merge/reindex):

n old new speedup
1,000 642 ns 503 ns 1.28x
10,000 2.96 µs 1.75 µs 1.69x
100,000 26.3 µs 14.0 µs 1.87x
1,000,000 260 µs 139 µs 1.87x
10,000,000 2.63 ms 1.40 ms 1.87x

is_sequence_range (int64, step=3 — Index construction):

n old new speedup
1,000 505 ns 379 ns 1.33x
10,000 2.94 µs 1.78 µs 1.65x
100,000 27.2 µs 15.7 µs 1.74x
1,000,000 272 µs 157 µs 1.74x
10,000,000 2.71 ms 1.56 ms 1.74x

Early-exit (short-circuit) path: 0.99–1.02x across all sizes for both — the unroll doesn't cost anything when the scan breaks early.

Test plan

  • Exhaustive correctness across int8/16/32/64, every break position, steps {1, 2, -1, 3}, plus length-mismatch / empty edge cases
  • pandas/tests/libs/test_lib.py + downstream suites (ranges, sort_values, reset_index, merge, strings)

No whatsnew entry, following the has_nans/all_nans precedent (GH-65192) — these are internal scan helpers and the per-op effect on end-user timings is below the noise floor.

🤖 Generated with Claude Code

4x-unroll the inner scan loops of is_range_indexer and is_sequence_range
in lib.pyx, combining lanes with bitwise | (not `or`) so the compiler can
emit vectorized comparisons, with a scalar tail for the remainder. Mirrors
the existing has_nans/all_nans unrolling (pandas-devGH-65192).

On the common full-scan path (array is a range / sequence) this is up to
~1.9x faster for is_range_indexer and ~1.7x for is_sequence_range at 100k+
elements, with no regression on the early-exit short-circuit path.

Split out from pandas-devGH-65298.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jbrockmendel jbrockmendel added the Performance Memory or execution speed performance label Jun 21, 2026
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Performance Memory or execution speed performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant