Skip to content

perf: hash join probe loop has iterator overhead and per-row Growable extends#7098

Open
QLiangong wants to merge 2 commits into
Eventual-Inc:mainfrom
QLiangong:new
Open

perf: hash join probe loop has iterator overhead and per-row Growable extends#7098
QLiangong wants to merge 2 commits into
Eventual-Inc:mainfrom
QLiangong:new

Conversation

@QLiangong

@QLiangong QLiangong commented Jun 10, 2026

Copy link
Copy Markdown

Changes Made

This PR introduces a fan-out-aware vectorized RecordBatch::take path
in probe_inner (src/daft-local-execution/src/join/inner_join.rs)
to replace the per-row GrowableRecordBatch::extend virtual call when
fan-out is high enough to make it pay off. Only inner_join.rs is
modified; no public API changes.

Implementation

  • Materialize matches up front. probe_inner now walks
    probe_indices() once and collects all matches into a local
    Vec<(probe_idx, build_rb_idx, build_row_idx)>, preallocated with
    Vec::with_capacity(input_table.len()). This gives us the total
    match count for capacity hints and lets the take path bucket the
    matches by build table.

  • Fan-out heuristic. Two module-level constants gate the new path:

    • TAKE_BATCH_MIN_MATCHES = 1024 — minimum total matches before
      the take path is considered; below this, setup cost dominates.
    • TAKE_BATCH_MIN_FANOUT = 3.0 — minimum
      matches.len() / probe_rows ratio before the take path is
      preferred; below this, take + concat overhead exceeds the
      per-row extend cost.

    Both are placeholder values pending a Rust-level micro-benchmark
    over (probe_rows, fan_out, num_build_partitions), modeled on
    src/daft-recordbatch/src/ops/bench_agg.rs.

  • Take-based build path (new, used when both thresholds met):

    1. Bucket matches by build_rb_idx into one Vec<u64> per build
      RecordBatch.
    2. For each non-empty bucket, build a UInt64Array and call
      RecordBatch::take.
    3. RecordBatch::concat the per-table results into the final
      build side.
    4. The probe side is built with a single RecordBatch::take over
      the collected probe_idx column.

    This replaces N virtual GrowableRecordBatch::extend(_, _, 1)
    calls with one vectorized arrow kernel call per build table.

  • Growable build path (preserved, used when below thresholds):
    same algorithm as the original probe_inner, but now consumes the
    materialized match vector and preallocates the GrowableRecordBatch
    and probe_side_idxs with matches_len.max(DEFAULT_GROWABLE_SIZE)
    capacity instead of the previous fixed DEFAULT_GROWABLE_SIZE = 20.
    This removes a few reallocations on join-heavy queries even when
    the take path is not selected.

  • build_final_table helper. Both paths produce
    (left_table, right_table) and share the same column-rearrangement
    step (compute common_join_keys / left_non_join_columns /
    right_non_join_columns, run get_columns_by_name, then union).
    Extracting this into a helper removes ~25 lines of duplication
    between the two paths.

Related Issues

Closes #7076

@QLiangong QLiangong requested a review from a team as a code owner June 10, 2026 02:45
@QLiangong QLiangong changed the title probe_inner perf: hash join probe loop has iterator overhead and per-row Growable extends Jun 10, 2026
@github-actions github-actions Bot added the perf label Jun 10, 2026
@greptile-apps

greptile-apps Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR optimizes probe_inner by adding a fan-out–aware dispatch: for large match sets (≥ 1024 matches, ≥ 3.0 matches/probe-row), it buckets build-side indices by RecordBatch index and calls vectorized take + concat instead of iterating through GrowableRecordBatch::extend one row at a time. It also extracts the final column-assembly logic into a shared build_final_table helper.

  • Correctness bug in the batch path: build-side rows are concatenated in rb_idx order, but probe_indices is kept in the original interleaved match order; when matches come from more than one build RecordBatch, the two sides are misaligned and the join outputs wrong row pairings.
  • Wrong schema for the empty build-side fallback: the build_side_tables.is_empty() branch creates an empty table with output_schema instead of the build-side schema, causing a schema-mismatch error in build_final_table.
  • probe_idx is narrowed from usize to u32, risking truncation for input tables larger than ~4 billion rows.

Confidence Score: 1/5

Not safe to merge: the batch code path produces silently incorrect join output whenever matches span more than one build-side RecordBatch.

The new batch path decouples the build-side row order (grouped by RecordBatch index) from the probe-side row order (original match order), causing row misalignment in the most common multi-batch build scenarios. The join would return wrong data without any error.

src/daft-local-execution/src/join/inner_join.rs — specifically the batch path from line 88 onward.

Important Files Changed

Filename Overview
src/daft-local-execution/src/join/inner_join.rs Adds a vectorized batch take code path for high fan-out inner joins, but the batch path misaligns probe and build rows when matches span multiple build RecordBatches, producing silently wrong join results.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[probe_inner called] --> B[Collect all matches into flat Vec]
    B --> C{matches_len >= 1024 AND fanout >= 3.0?}
    C -- YES: batch path --> D[Split into probe_indices and per_table_indices by rb_idx]
    D --> E[take each build table by its bucket indices]
    E --> F[concat build tables in rb_idx order]
    F --> G[take probe table by probe_indices in match order]
    G --> H[MISALIGNED rows]
    H --> I[build_final_table]
    C -- NO: growable path --> J[GrowableRecordBatch.extend per match]
    J --> K[build_side_growable.build]
    K --> L[take probe table by probe_side_idxs]
    L --> M[Rows correctly aligned]
    M --> I
    I --> N[MicroPartition output]
Loading

Reviews (1): Last reviewed commit: "probe_inner" | Re-trigger Greptile

Comment on lines +88 to +113
let (left_table, right_table) = if use_batch {
let build_tables = probe_state.get_record_batches();
let mut probe_indices = Vec::with_capacity(matches_len);
let mut per_table_indices = vec![Vec::new(); build_tables.len()];

for (probe_idx, rb_idx, row_idx) in matches {
probe_indices.push(probe_idx as u64);
per_table_indices[rb_idx as usize].push(row_idx);
}

let (left_table, right_table) = if params.build_on_left {
(build_side_table, probe_side_table)
let mut build_side_tables = Vec::new();
for (rb_idx, indices) in per_table_indices.into_iter().enumerate() {
if !indices.is_empty() {
let idx_arr = UInt64Array::from_vec("", indices);
let taken = build_tables[rb_idx].take(&idx_arr)?;
build_side_tables.push(taken);
}
}

let build_side_table = if build_side_tables.is_empty() {
daft_recordbatch::RecordBatch::empty(Some(params.output_schema.clone()))
} else if build_side_tables.len() == 1 {
build_side_tables.pop().unwrap()
} else {
daft_recordbatch::RecordBatch::concat(&build_side_tables)?
};

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Build/probe row alignment broken for multi-table build sides

In the batch path, probe_indices is filled in the original match order (interleaved across rb_idx values), but the build-side rows are concatenated in table-index order (rb_idx=0 first, then rb_idx=1, etc.). When the build side spans more than one RecordBatch, the two sides no longer correspond row-for-row and the join emits silently wrong results.

Concrete failure: suppose matches = [(probe=5, rb=1, row=10), (probe=3, rb=0, row=7), (probe=8, rb=1, row=2)]. probe_indices = [5, 3, 8]; the concatenated build side is [table[0][7], table[1][10], table[1][2]]. The final join pairs probe[5] with table[0][7] instead of table[1][10], and probe[3] with table[1][10] instead of table[0][7].

The fix is to sort matches by rb_idx before splitting into probe_indices and per_table_indices, so that probe_indices is accumulated in the same grouped order that the build-side concat produces.

Comment on lines +107 to +108
let build_side_table = if build_side_tables.is_empty() {
daft_recordbatch::RecordBatch::empty(Some(params.output_schema.clone()))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Wrong schema used for the empty build-side fallback

RecordBatch::empty(Some(params.output_schema.clone())) creates a table whose schema is the full output schema, but this table is used as the build-side fragment before build_final_table tries to extract common_join_keys and left_non_join_columns from it. The empty table should be created with the build-side schema: left_schema when params.build_on_left is true, right_schema otherwise.

Comment on lines +67 to +71
let mut matches: Vec<(u32, u64, u64)> = Vec::with_capacity(input_table.len());
for (probe_idx, inner_iter) in idx_iter.enumerate() {
if let Some(inner_iter) = inner_iter {
for (build_rb_idx, build_row_idx) in inner_iter {
build_side_growable.extend(
build_rb_idx as usize,
build_row_idx as usize,
1,
);
probe_side_idxs.push(probe_row_idx as u64);
for (rb_idx, row_idx) in inner_iter {
matches.push((probe_idx as u32, rb_idx as u64, row_idx));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 probe_idx silently truncates on large input tables

idx_iter.enumerate() yields a usize counter cast to u32 before being stored in matches. On a 64-bit platform, any input table with more than u32::MAX (~4 billion) rows will silently produce wrong probe-side indices. Storing probe_idx as u64 would be consistent with row_idx and avoids the overflow.

@srilman srilman left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this does make sense to me but I have a couple of thoughts:

  • Do you happen to have any performance numbers? Would be helpful to see how much of a benefit
  • In addition, do you happen to have any memory profiles, even a rough measurement of RSS usage over the timeline of a query? I believe we went with the growable approach because it is less likely to make large memory allocations which could cause OOMs
  • Is there much benefit to maintaining the growable path for cases where there are only 1024 matches? that seems so small that the performance difference would be negligible

cc @colin-ho on this, I believe we tried something similar before and noticed that there were a lot of OOMs. i think this approach is better but do you see anything potentially problematic

// build_rb_idx for the take-based path. The probe-side hot loop
// is a tight `Vec::push`, which the compiler lowers to the same
// code as a `flat_map(...).collect()` form.
let mut matches: Vec<(u32, u64, u64)> = Vec::with_capacity(input_table.len());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than putting everything into matches and then splitting by build_rb_idx, why not just do the fanout here? That way we can cut the memory usage by half

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: hash join probe loop has iterator overhead and per-row Growable extends

2 participants