perf: hash join probe loop has iterator overhead and per-row Growable extends by QLiangong · Pull Request #7098 · Eventual-Inc/Daft

QLiangong · 2026-06-10T02:45:53Z

Changes Made

This PR introduces a fan-out-aware vectorized RecordBatch::take path
in probe_inner (src/daft-local-execution/src/join/inner_join.rs)
to replace the per-row GrowableRecordBatch::extend virtual call when
fan-out is high enough to make it pay off. Only inner_join.rs is
modified; no public API changes.

Implementation

Materialize matches up front. probe_inner now walks
probe_indices() once and collects all matches into a local
Vec<(probe_idx, build_rb_idx, build_row_idx)>, preallocated with
Vec::with_capacity(input_table.len()). This gives us the total
match count for capacity hints and lets the take path bucket the
matches by build table.
Fan-out heuristic. Two module-level constants gate the new path:
- TAKE_BATCH_MIN_MATCHES = 1024 — minimum total matches before
  the take path is considered; below this, setup cost dominates.
- TAKE_BATCH_MIN_FANOUT = 3.0 — minimum
  matches.len() / probe_rows ratio before the take path is
  preferred; below this, take + concat overhead exceeds the
  per-row extend cost.
Both are placeholder values pending a Rust-level micro-benchmark
over (probe_rows, fan_out, num_build_partitions), modeled on
src/daft-recordbatch/src/ops/bench_agg.rs.
Take-based build path (new, used when both thresholds met):
1. Bucket matches by build_rb_idx into one Vec<u64> per build
  RecordBatch.
2. For each non-empty bucket, build a UInt64Array and call
  RecordBatch::take.
3. RecordBatch::concat the per-table results into the final
  build side.
4. The probe side is built with a single RecordBatch::take over
  the collected probe_idx column.
This replaces N virtual GrowableRecordBatch::extend(_, _, 1)
calls with one vectorized arrow kernel call per build table.
Growable build path (preserved, used when below thresholds):
same algorithm as the original probe_inner, but now consumes the
materialized match vector and preallocates the GrowableRecordBatch
and probe_side_idxs with matches_len.max(DEFAULT_GROWABLE_SIZE)
capacity instead of the previous fixed DEFAULT_GROWABLE_SIZE = 20.
This removes a few reallocations on join-heavy queries even when
the take path is not selected.
build_final_table helper. Both paths produce
(left_table, right_table) and share the same column-rearrangement
step (compute common_join_keys / left_non_join_columns /
right_non_join_columns, run get_columns_by_name, then union).
Extracting this into a helper removes ~25 lines of duplication
between the two paths.

Related Issues

Closes #7076

greptile-apps · 2026-06-10T02:49:05Z

Greptile Summary

This PR optimizes probe_inner by adding a fan-out–aware dispatch: for large match sets (≥ 1024 matches, ≥ 3.0 matches/probe-row), it buckets build-side indices by RecordBatch index and calls vectorized take + concat instead of iterating through GrowableRecordBatch::extend one row at a time. It also extracts the final column-assembly logic into a shared build_final_table helper.

Correctness bug in the batch path: build-side rows are concatenated in rb_idx order, but probe_indices is kept in the original interleaved match order; when matches come from more than one build RecordBatch, the two sides are misaligned and the join outputs wrong row pairings.
Wrong schema for the empty build-side fallback: the build_side_tables.is_empty() branch creates an empty table with output_schema instead of the build-side schema, causing a schema-mismatch error in build_final_table.
probe_idx is narrowed from usize to u32, risking truncation for input tables larger than ~4 billion rows.

Confidence Score: 1/5

Not safe to merge: the batch code path produces silently incorrect join output whenever matches span more than one build-side RecordBatch.

The new batch path decouples the build-side row order (grouped by RecordBatch index) from the probe-side row order (original match order), causing row misalignment in the most common multi-batch build scenarios. The join would return wrong data without any error.

src/daft-local-execution/src/join/inner_join.rs — specifically the batch path from line 88 onward.

Important Files Changed

Filename	Overview
src/daft-local-execution/src/join/inner_join.rs	Adds a vectorized batch take code path for high fan-out inner joins, but the batch path misaligns probe and build rows when matches span multiple build RecordBatches, producing silently wrong join results.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[probe_inner called] --> B[Collect all matches into flat Vec]
    B --> C{matches_len >= 1024 AND fanout >= 3.0?}
    C -- YES: batch path --> D[Split into probe_indices and per_table_indices by rb_idx]
    D --> E[take each build table by its bucket indices]
    E --> F[concat build tables in rb_idx order]
    F --> G[take probe table by probe_indices in match order]
    G --> H[MISALIGNED rows]
    H --> I[build_final_table]
    C -- NO: growable path --> J[GrowableRecordBatch.extend per match]
    J --> K[build_side_growable.build]
    K --> L[take probe table by probe_side_idxs]
    L --> M[Rows correctly aligned]
    M --> I
    I --> N[MicroPartition output]

_{Reviews (1): Last reviewed commit: "probe_inner" | Re-trigger Greptile}

greptile-apps · 2026-06-10T02:49:09Z

+            let (left_table, right_table) = if use_batch {
+                let build_tables = probe_state.get_record_batches();
+                let mut probe_indices = Vec::with_capacity(matches_len);
+                let mut per_table_indices = vec![Vec::new(); build_tables.len()];
+
+                for (probe_idx, rb_idx, row_idx) in matches {
+                    probe_indices.push(probe_idx as u64);
+                    per_table_indices[rb_idx as usize].push(row_idx);
+                }

-            let (left_table, right_table) = if params.build_on_left {
-                (build_side_table, probe_side_table)
+                let mut build_side_tables = Vec::new();
+                for (rb_idx, indices) in per_table_indices.into_iter().enumerate() {
+                    if !indices.is_empty() {
+                        let idx_arr = UInt64Array::from_vec("", indices);
+                        let taken = build_tables[rb_idx].take(&idx_arr)?;
+                        build_side_tables.push(taken);
+                    }
+                }
+
+                let build_side_table = if build_side_tables.is_empty() {
+                    daft_recordbatch::RecordBatch::empty(Some(params.output_schema.clone()))
+                } else if build_side_tables.len() == 1 {
+                    build_side_tables.pop().unwrap()
+                } else {
+                    daft_recordbatch::RecordBatch::concat(&build_side_tables)?
+                };


Build/probe row alignment broken for multi-table build sides

In the batch path, probe_indices is filled in the original match order (interleaved across rb_idx values), but the build-side rows are concatenated in table-index order (rb_idx=0 first, then rb_idx=1, etc.). When the build side spans more than one RecordBatch, the two sides no longer correspond row-for-row and the join emits silently wrong results.

Concrete failure: suppose matches = [(probe=5, rb=1, row=10), (probe=3, rb=0, row=7), (probe=8, rb=1, row=2)]. probe_indices = [5, 3, 8]; the concatenated build side is [table[0][7], table[1][10], table[1][2]]. The final join pairs probe[5] with table[0][7] instead of table[1][10], and probe[3] with table[1][10] instead of table[0][7].

The fix is to sort matches by rb_idx before splitting into probe_indices and per_table_indices, so that probe_indices is accumulated in the same grouped order that the build-side concat produces.

greptile-apps · 2026-06-10T02:49:10Z

+                let build_side_table = if build_side_tables.is_empty() {
+                    daft_recordbatch::RecordBatch::empty(Some(params.output_schema.clone()))


Wrong schema used for the empty build-side fallback

RecordBatch::empty(Some(params.output_schema.clone())) creates a table whose schema is the full output schema, but this table is used as the build-side fragment before build_final_table tries to extract common_join_keys and left_non_join_columns from it. The empty table should be created with the build-side schema: left_schema when params.build_on_left is true, right_schema otherwise.

greptile-apps · 2026-06-10T02:49:11Z

+            let mut matches: Vec<(u32, u64, u64)> = Vec::with_capacity(input_table.len());
+            for (probe_idx, inner_iter) in idx_iter.enumerate() {
                if let Some(inner_iter) = inner_iter {
-                    for (build_rb_idx, build_row_idx) in inner_iter {
-                        build_side_growable.extend(
-                            build_rb_idx as usize,
-                            build_row_idx as usize,
-                            1,
-                        );
-                        probe_side_idxs.push(probe_row_idx as u64);
+                    for (rb_idx, row_idx) in inner_iter {
+                        matches.push((probe_idx as u32, rb_idx as u64, row_idx));


probe_idx silently truncates on large input tables

idx_iter.enumerate() yields a usize counter cast to u32 before being stored in matches. On a 64-bit platform, any input table with more than u32::MAX (~4 billion) rows will silently produce wrong probe-side indices. Storing probe_idx as u64 would be consistent with row_idx and avoids the overflow.

srilman

Overall, this does make sense to me but I have a couple of thoughts:

Do you happen to have any performance numbers? Would be helpful to see how much of a benefit
In addition, do you happen to have any memory profiles, even a rough measurement of RSS usage over the timeline of a query? I believe we went with the growable approach because it is less likely to make large memory allocations which could cause OOMs
Is there much benefit to maintaining the growable path for cases where there are only 1024 matches? that seems so small that the performance difference would be negligible

cc @colin-ho on this, I believe we tried something similar before and noticed that there were a lot of OOMs. i think this approach is better but do you see anything potentially problematic

srilman · 2026-06-11T17:09:46Z

+            // build_rb_idx for the take-based path. The probe-side hot loop
+            // is a tight `Vec::push`, which the compiler lowers to the same
+            // code as a `flat_map(...).collect()` form.
+            let mut matches: Vec<(u32, u64, u64)> = Vec::with_capacity(input_table.len());


Rather than putting everything into matches and then splitting by build_rb_idx, why not just do the fanout here? That way we can cut the memory usage by half

probe_inner

72bf235

QLiangong requested a review from a team as a code owner June 10, 2026 02:45

QLiangong changed the title ~~probe_inner~~ perf: hash join probe loop has iterator overhead and per-row Growable extends Jun 10, 2026

github-actions Bot added the perf label Jun 10, 2026

greptile-apps Bot reviewed Jun 10, 2026

View reviewed changes

rohitkulshreshtha requested a review from srilman June 11, 2026 16:26

srilman reviewed Jun 11, 2026

View reviewed changes

probe_inner_fix

69ae766

srilman mentioned this pull request Jun 22, 2026

perf(join): vectorize high-fanout inner join build takes #7143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: hash join probe loop has iterator overhead and per-row Growable extends#7098

perf: hash join probe loop has iterator overhead and per-row Growable extends#7098
QLiangong wants to merge 2 commits into
Eventual-Inc:mainfrom
QLiangong:new

QLiangong commented Jun 10, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 10, 2026

Uh oh!

greptile-apps Bot Jun 10, 2026

Uh oh!

greptile-apps Bot Jun 10, 2026

Uh oh!

greptile-apps Bot Jun 10, 2026

Uh oh!

srilman left a comment

Uh oh!

srilman Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		let build_side_table = if build_side_tables.is_empty() {
		daft_recordbatch::RecordBatch::empty(Some(params.output_schema.clone()))

Conversation

QLiangong commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

Implementation

Related Issues

Uh oh!

greptile-apps Bot commented Jun 10, 2026

Greptile Summary

Confidence Score: 1/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 10, 2026

Choose a reason for hiding this comment

Uh oh!

srilman left a comment

Choose a reason for hiding this comment

Uh oh!

srilman Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

QLiangong commented Jun 10, 2026 •

edited

Loading