feat(grouped-agg): shard PartitionThenAgg execution per morsel by BABTUNA · Pull Request #7079 · Eventual-Inc/Daft

BABTUNA · 2026-06-06T20:00:14Z

Summary

Shard the PartitionThenAgg execution path inside GroupedAggregateSink::sink so that a single large input MicroPartition is row-range fanned out across 4 concurrent partition_by_hash tasks before the existing flush-on-threshold logic runs. Small inputs continue to run the existing path. Single-file diff to src/daft-local-execution/src/sinks/grouped_aggregate.rs. Stacked on #7060.

Why

Continues #6585 item 5. PR #7060 sharded the AggThenPartition strategy (selected for low-cardinality groupbys). This PR extends the same row-range sharding to PartitionThenAgg (selected for high-cardinality groupbys). Both share the same single-threaded partition_by_hash work that benefits from K-way parallelism on large morsels.

Changes Made

execute_partition_then_agg becomes async; dispatch in execute_strategy awaits it
Branches on input size, reusing the existing SHARD_THRESHOLD = 32_768 and NUM_SHARDS_PER_MORSEL = 4 constants from feat(grouped-agg): shard AggThenPartition execution per morsel #7060:
- < SHARD_THRESHOLD: existing sync logic (unchanged behavior)
- >= SHARD_THRESHOLD: row-range slice into 4 shards, spawn 4 tokio tasks each running shard.partition_by_hash(group_by, num_slots), instrumented with the parent span. Then merge each shard's slot-i output into slot i of inner_states using the same flush-on-threshold logic as the sync path
PartitionOnly left unchanged for a follow-up PR

Behavior

Functionally equivalent for all input sizes:

Small inputs (< 32k rows): exact same code path as before
Large inputs (>= 32k rows): the K partial partition_by_hash results are merged in shard order into the existing per-slot state. The flush-on-threshold check runs sequentially in the merge loop because state.unaggregated_size is per-slot shared state across shards
Partial agg combinators (Sum, Min, Max, Product, BoolAnd/Or, AnyValue) are commutative and associative, so it doesn't matter which shard's chunks end up triggering a flush vs being deferred to the next morsel

Test Plan

cargo build -p daft-local-execution --lib clean
cargo fmt -p daft-local-execution clean
Sinks have no Rust unit tests; correctness covered by the existing Python integration suite under tests/dataframe/test_groupby*.py and broader query tests

Related Issues

Part of #6585 item 5. Builds on #7060.

greptile-apps · 2026-06-06T20:04:28Z

Greptile Summary

This PR fans large morsels (≥ 32 768 rows) across 4 concurrent tokio::task workers inside GroupedAggregateSink::sink, applying the same row-range sharding already present in AggThenPartition to the PartitionThenAgg strategy. Both execute_agg_then_partition and execute_partition_then_agg become async; small inputs continue on the original single-threaded path.

Both strategies slice the input into up to 4 equal row ranges, spawn independent partition_by_hash (and optionally agg) tasks via JoinSet, then merge shard results back into inner_states with the existing flush-on-threshold logic.
CPU-bound partition_by_hash / agg closures are dispatched with JoinSet::spawn rather than spawn_blocking, which can block tokio's async thread pool; and join_all returns results in completion order, making shard-merge ordering non-deterministic across runs.
SHARD_THRESHOLD's doc comment still names only AggThenPartition after the constant was promoted to gating both strategies.

Confidence Score: 4/5

The change is functionally correct — partial-agg combinators are commutative so merge order doesn't affect results — but CPU-bound work on tokio's async pool and non-deterministic shard ordering from join_all leave some operational risk around thread-pool starvation and unpredictable memory-flush behaviour under load.

Both strategies produce correct aggregation results for all input sizes. The main concerns are that CPU-bound partition_by_hash/agg closures block tokio's cooperative async workers (should use spawn_blocking), and that JoinSet::join_all delivers shard results in completion rather than spawn order, making the flush-on-threshold trigger non-deterministic run to run. Neither issue causes wrong output, but both could cause latency or memory-usage surprises in production.

src/daft-local-execution/src/sinks/grouped_aggregate.rs — specifically the spawn call sites and the join_all merge loop in both sharding paths.

Important Files Changed

Filename	Overview
src/daft-local-execution/src/sinks/grouped_aggregate.rs	Extends row-range sharding to `PartitionThenAgg` and `AggThenPartition` strategies; spawns up to 4 tokio tasks per large morsel using CPU-bound closures on the async thread pool, with non-deterministic `join_all` completion order in the merge loop.

_{Reviews (1): Last reviewed commit: "feat(grouped-agg): shard PartitionThenAg..." | Re-trigger Greptile}

greptile-apps · 2026-06-06T20:04:32Z

+/// Minimum input rows before the `AggThenPartition` strategy fans a single morsel
+/// out across multiple shard tasks. Smaller inputs run the existing single-threaded
+/// path so the per-task overhead doesn't dominate.
+const SHARD_THRESHOLD: usize = 32_768;
+
+/// Number of shard tasks spawned per morsel when the input crosses
+/// `SHARD_THRESHOLD`. Fixed rather than tied to `max_concurrency` because the
+/// framework already runs `max_concurrency` morsels concurrently; fanning out
+/// further per morsel would oversubscribe.
+const NUM_SHARDS_PER_MORSEL: usize = 4;


Stale doc comment on SHARD_THRESHOLD

The constant's doc comment says "before the AggThenPartition strategy fans…" but after this PR the same constant gates sharding in PartitionThenAgg as well. A reader scanning just the constant will think only AggThenPartition is affected.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

greptile-apps · 2026-06-06T20:04:33Z

+            tasks.spawn(
+                async move { shard.partition_by_hash(&group_by, num_slots) }
+                    .instrument(Span::current()),
+            );


CPU-bound work (partition_by_hash, agg) is spawned with tokio::task::JoinSet::spawn, which places tasks on tokio's cooperative async thread pool. Because these closures contain no .await points they run to completion without yielding, potentially blocking tokio workers and starving concurrent I/O tasks. For genuinely CPU-bound work, tokio::task::spawn_blocking (which dispatches to a dedicated blocking pool) is the idiomatic choice. The same pattern exists in the AggThenPartition sharding above.

Suggested change

tasks.spawn(

async move { shard.partition_by_hash(&group_by, num_slots) }

.instrument(Span::current()),

);

let span = Span::current();

tasks.spawn(tokio::task::spawn_blocking(move || {

let _enter = span.enter();

shard.partition_by_hash(&group_by, num_slots)

}).map(|r| r.expect("shard task panicked")));

Matches the existing pattern in finalize() in the same file, which also dispatches CPU-bound concat+agg work via tokio::task::JoinSet::spawn rather than spawn_blocking. If this is worth changing it would be a sink-wide refactor across all spawn sites, not just this PR.

You're right. The finalize() method already uses tokio::task::JoinSet::spawn for CPU-bound concat + agg + eval_expression_list work with no .await points — the same pattern this PR follows. Changing to spawn_blocking would be a sink-wide refactor touching finalize() and all the sharding sites, not something scoped to this PR. Fair point, acknowledged.

greptile-apps · 2026-06-06T20:04:34Z

+        let shard_results: Vec<Vec<MicroPartition>> = tasks
+            .join_all()
+            .await
+            .into_iter()
+            .collect::<DaftResult<Vec<_>>>()?;


JoinSet::join_all returns results in completion order, not spawn order

JoinSet::join_all() is implemented via repeated join_next(), which yields tasks in the order they finish, not the order they were spawned. For PartitionThenAgg this means the flush-on-threshold logic in the merge loop runs over shards in a non-deterministic order. The PR author correctly argues correctness is preserved because partial-agg combinators are commutative, but the non-deterministic flush ordering makes memory-usage patterns harder to reason about and can make test assertions around intermediate state order-sensitive. Consider using an ordered collection (e.g. collecting into a Vec indexed by shard_idx) so processing order is deterministic.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Keeping join_all. Correctness is preserved (commutative combinators as noted) and forcing spawn-order via indexed collection adds code without changing observable output. finalize() in the same file uses the same join_all pattern.

…n-then-agg

colin-ho

Same comments as #7060, need concrete justification that this is better.

BABTUNA added 5 commits June 3, 2026 07:29

feat(grouped-agg): shard AggThenPartition execution per morsel

b926b9c

chore(grouped-agg): instrument sharded tasks with parent span

ade6c28

ci: retrigger to clear flaky HuggingFace 429 and Ray actor timeout

59a03b2

ci: retrigger to clear flaky HF 429 and Ray actor timeout

f8a9185

feat(grouped-agg): shard PartitionThenAgg execution per morsel

cfbc890

BABTUNA requested a review from a team as a code owner June 6, 2026 20:00

github-actions Bot added the feat label Jun 6, 2026

greptile-apps Bot reviewed Jun 6, 2026

View reviewed changes

BABTUNA added 2 commits June 6, 2026 14:45

Merge remote-tracking branch 'origin/main' into perf/sharded-partitio…

77e9854

…n-then-agg

chore(grouped-agg): clarify SHARD_THRESHOLD doc covers both strategies

3ad1f42

colin-ho reviewed Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(grouped-agg): shard PartitionThenAgg execution per morsel#7079

feat(grouped-agg): shard PartitionThenAgg execution per morsel#7079
BABTUNA wants to merge 7 commits into
Eventual-Inc:mainfrom
BABTUNA:perf/sharded-partition-then-agg

BABTUNA commented Jun 6, 2026

Uh oh!

greptile-apps Bot commented Jun 6, 2026

Uh oh!

greptile-apps Bot Jun 6, 2026

Uh oh!

greptile-apps Bot Jun 6, 2026

Uh oh!

BABTUNA Jun 6, 2026

Uh oh!

greptile-apps Bot Jun 6, 2026

Uh oh!

greptile-apps Bot Jun 6, 2026

Uh oh!

BABTUNA Jun 6, 2026

Uh oh!

colin-ho left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BABTUNA commented Jun 6, 2026

Summary

Why

Changes Made

Behavior

Test Plan

Related Issues

Uh oh!

greptile-apps Bot commented Jun 6, 2026

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Uh oh!

greptile-apps Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

BABTUNA Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

BABTUNA Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

colin-ho left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants