Skip to content

feat(grouped-agg): shard PartitionThenAgg execution per morsel#7079

Open
BABTUNA wants to merge 7 commits into
Eventual-Inc:mainfrom
BABTUNA:perf/sharded-partition-then-agg
Open

feat(grouped-agg): shard PartitionThenAgg execution per morsel#7079
BABTUNA wants to merge 7 commits into
Eventual-Inc:mainfrom
BABTUNA:perf/sharded-partition-then-agg

Conversation

@BABTUNA

@BABTUNA BABTUNA commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

Shard the PartitionThenAgg execution path inside GroupedAggregateSink::sink so that a single large input MicroPartition is row-range fanned out across 4 concurrent partition_by_hash tasks before the existing flush-on-threshold logic runs. Small inputs continue to run the existing path. Single-file diff to src/daft-local-execution/src/sinks/grouped_aggregate.rs. Stacked on #7060.

Why

Continues #6585 item 5. PR #7060 sharded the AggThenPartition strategy (selected for low-cardinality groupbys). This PR extends the same row-range sharding to PartitionThenAgg (selected for high-cardinality groupbys). Both share the same single-threaded partition_by_hash work that benefits from K-way parallelism on large morsels.

Changes Made

  • execute_partition_then_agg becomes async; dispatch in execute_strategy awaits it
  • Branches on input size, reusing the existing SHARD_THRESHOLD = 32_768 and NUM_SHARDS_PER_MORSEL = 4 constants from feat(grouped-agg): shard AggThenPartition execution per morsel #7060:
    • < SHARD_THRESHOLD: existing sync logic (unchanged behavior)
    • >= SHARD_THRESHOLD: row-range slice into 4 shards, spawn 4 tokio tasks each running shard.partition_by_hash(group_by, num_slots), instrumented with the parent span. Then merge each shard's slot-i output into slot i of inner_states using the same flush-on-threshold logic as the sync path
  • PartitionOnly left unchanged for a follow-up PR

Behavior

Functionally equivalent for all input sizes:

  • Small inputs (< 32k rows): exact same code path as before
  • Large inputs (>= 32k rows): the K partial partition_by_hash results are merged in shard order into the existing per-slot state. The flush-on-threshold check runs sequentially in the merge loop because state.unaggregated_size is per-slot shared state across shards
  • Partial agg combinators (Sum, Min, Max, Product, BoolAnd/Or, AnyValue) are commutative and associative, so it doesn't matter which shard's chunks end up triggering a flush vs being deferred to the next morsel

Test Plan

  • cargo build -p daft-local-execution --lib clean
  • cargo fmt -p daft-local-execution clean
  • Sinks have no Rust unit tests; correctness covered by the existing Python integration suite under tests/dataframe/test_groupby*.py and broader query tests

Related Issues

Part of #6585 item 5. Builds on #7060.

@BABTUNA BABTUNA requested a review from a team as a code owner June 6, 2026 20:00
@github-actions github-actions Bot added the feat label Jun 6, 2026
@greptile-apps

greptile-apps Bot commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fans large morsels (≥ 32 768 rows) across 4 concurrent tokio::task workers inside GroupedAggregateSink::sink, applying the same row-range sharding already present in AggThenPartition to the PartitionThenAgg strategy. Both execute_agg_then_partition and execute_partition_then_agg become async; small inputs continue on the original single-threaded path.

  • Both strategies slice the input into up to 4 equal row ranges, spawn independent partition_by_hash (and optionally agg) tasks via JoinSet, then merge shard results back into inner_states with the existing flush-on-threshold logic.
  • CPU-bound partition_by_hash / agg closures are dispatched with JoinSet::spawn rather than spawn_blocking, which can block tokio's async thread pool; and join_all returns results in completion order, making shard-merge ordering non-deterministic across runs.
  • SHARD_THRESHOLD's doc comment still names only AggThenPartition after the constant was promoted to gating both strategies.

Confidence Score: 4/5

The change is functionally correct — partial-agg combinators are commutative so merge order doesn't affect results — but CPU-bound work on tokio's async pool and non-deterministic shard ordering from join_all leave some operational risk around thread-pool starvation and unpredictable memory-flush behaviour under load.

Both strategies produce correct aggregation results for all input sizes. The main concerns are that CPU-bound partition_by_hash/agg closures block tokio's cooperative async workers (should use spawn_blocking), and that JoinSet::join_all delivers shard results in completion rather than spawn order, making the flush-on-threshold trigger non-deterministic run to run. Neither issue causes wrong output, but both could cause latency or memory-usage surprises in production.

src/daft-local-execution/src/sinks/grouped_aggregate.rs — specifically the spawn call sites and the join_all merge loop in both sharding paths.

Important Files Changed

Filename Overview
src/daft-local-execution/src/sinks/grouped_aggregate.rs Extends row-range sharding to PartitionThenAgg and AggThenPartition strategies; spawns up to 4 tokio tasks per large morsel using CPU-bound closures on the async thread pool, with non-deterministic join_all completion order in the merge loop.

Reviews (1): Last reviewed commit: "feat(grouped-agg): shard PartitionThenAg..." | Re-trigger Greptile

Comment on lines +26 to +35
/// Minimum input rows before the `AggThenPartition` strategy fans a single morsel
/// out across multiple shard tasks. Smaller inputs run the existing single-threaded
/// path so the per-task overhead doesn't dominate.
const SHARD_THRESHOLD: usize = 32_768;

/// Number of shard tasks spawned per morsel when the input crosses
/// `SHARD_THRESHOLD`. Fixed rather than tied to `max_concurrency` because the
/// framework already runs `max_concurrency` morsels concurrently; fanning out
/// further per morsel would oversubscribe.
const NUM_SHARDS_PER_MORSEL: usize = 4;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Stale doc comment on SHARD_THRESHOLD

The constant's doc comment says "before the AggThenPartition strategy fans…" but after this PR the same constant gates sharding in PartitionThenAgg as well. A reader scanning just the constant will think only AggThenPartition is affected.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Comment on lines +173 to +176
tasks.spawn(
async move { shard.partition_by_hash(&group_by, num_slots) }
.instrument(Span::current()),
);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 CPU-bound work (partition_by_hash, agg) is spawned with tokio::task::JoinSet::spawn, which places tasks on tokio's cooperative async thread pool. Because these closures contain no .await points they run to completion without yielding, potentially blocking tokio workers and starving concurrent I/O tasks. For genuinely CPU-bound work, tokio::task::spawn_blocking (which dispatches to a dedicated blocking pool) is the idiomatic choice. The same pattern exists in the AggThenPartition sharding above.

Suggested change
tasks.spawn(
async move { shard.partition_by_hash(&group_by, num_slots) }
.instrument(Span::current()),
);
let span = Span::current();
tasks.spawn(tokio::task::spawn_blocking(move || {
let _enter = span.enter();
shard.partition_by_hash(&group_by, num_slots)
}).map(|r| r.expect("shard task panicked")));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Matches the existing pattern in finalize() in the same file, which also dispatches CPU-bound concat+agg work via tokio::task::JoinSet::spawn rather than spawn_blocking. If this is worth changing it would be a sink-wide refactor across all spawn sites, not just this PR.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. The finalize() method already uses tokio::task::JoinSet::spawn for CPU-bound concat + agg + eval_expression_list work with no .await points — the same pattern this PR follows. Changing to spawn_blocking would be a sink-wide refactor touching finalize() and all the sharding sites, not something scoped to this PR. Fair point, acknowledged.

Comment on lines +179 to +183
let shard_results: Vec<Vec<MicroPartition>> = tasks
.join_all()
.await
.into_iter()
.collect::<DaftResult<Vec<_>>>()?;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 JoinSet::join_all returns results in completion order, not spawn order

JoinSet::join_all() is implemented via repeated join_next(), which yields tasks in the order they finish, not the order they were spawned. For PartitionThenAgg this means the flush-on-threshold logic in the merge loop runs over shards in a non-deterministic order. The PR author correctly argues correctness is preserved because partial-agg combinators are commutative, but the non-deterministic flush ordering makes memory-usage patterns harder to reason about and can make test assertions around intermediate state order-sensitive. Consider using an ordered collection (e.g. collecting into a Vec indexed by shard_idx) so processing order is deterministic.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping join_all. Correctness is preserved (commutative combinators as noted) and forcing spawn-order via indexed collection adds code without changing observable output. finalize() in the same file uses the same join_all pattern.

@colin-ho colin-ho left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comments as #7060, need concrete justification that this is better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants