Σ.G — Generic fused aggregate¶
Goal: retire the 5+ InjectFusedQN per-query rules and replace
them with a single shape-parameterised FusedAggregateExec + a
cost-aware planner rule that emits it whenever the shape qualifies.
Why now: the SF=1 TPC-H triangulation (2026-05-17 AWS campaign) confirms ematix-flow's fused path beats DuckDB on 11+ queries and Polars on 12+. The direction works. Per-query rules were the right first move; generalising them is the next.
Non-goal: changing what the fused operators do. Same kernels, same Arrow code paths — the win came from the operators themselves, not the routing.
Inventory: what exists today¶
| Rule | Matches | Operator |
|---|---|---|
InjectFusedQ1 |
lineitem GROUP BY (l_returnflag, l_linestatus) with 8 aggregates |
FusedQ1Exec |
InjectFusedQ3 |
lineitem ⋈ orders ⋈ customer filter+SUM+SUM |
FusedQ3Exec |
InjectFusedQ5 |
6-way join with region filter | FusedQ5Exec |
InjectFusedQ6 |
lineitem filter + SUM(l_extendedprice * l_discount) |
FusedQ6Exec |
InjectFusedQ12 |
lineitem filter + 2× conditional COUNT |
FusedQ12Exec |
EnableDictGroupCountRule |
any aggregate with Dictionary<UInt32, Utf8> group key |
DictGroupCountExec |
Shared: AggregateShapeConfig walker recognises group keys,
aggregates, and filters declaratively. Adding rule #7 today is ~30
lines instead of the original ~150. But each rule still produces a
different physical operator.
What "generic" actually means¶
A single FusedAggregateExec parameterised by:
pub struct FusedAggregateExec {
input: Arc<dyn ExecutionPlan>,
shape: AggregateShapeConfig, // group keys, agg defs, filter
decode_strategy: DecodeStrategy, // late-mat | dict-preserving | plain
join_strategy: Option<FusedJoinSpec>, // for Q3/Q5/Q9 multi-way
}
The planner emits FusedAggregateExec { shape: SHAPE_Q1, ... } for
any plan whose subtree matches Q1's pattern — regardless of source
table. The shape configs themselves stay declarative; the operator
implementation handles all of them.
Phases¶
Σ.G.1 — Operator audit (~2 days, no behaviour change)¶
Read all 5 FusedQN operators side-by-side. Document the shared
subset:
- Common: decode → predicate → group hash → accumulate
- Differs: which decode kernel, which predicate shape, how many
aggregates, whether there's a join
Output: docs/FUSED_AGGREGATE_SHAPES.md — a table mapping each
operator's hot loop to the abstractions a unified executor would
need. Identifies whether one operator is a strict superset (probably
Q1 — most general aggregate set) or whether they're orthogonal.
Σ.G.2 — Unify single-table fused aggregates (~4 days)¶
Q1, Q6, Q12 are all single-table → filter → group → agg. Collapse
to one operator first.
- Add
FusedAggregateExecwith a single hot loop that runs the filter + group + accumulate, parameterised byAggregateShapeConfig - Update
InjectFusedQ1,InjectFusedQ6,InjectFusedQ12to all emit the unified operator with their respective shape configs - Delete
FusedQ1Exec,FusedQ6Exec,FusedQ12Exec - Gate: TPC-H SF=1 triangulation, no regression > 5% on Q1, Q6, Q12 (use the SF=1 BENCHMARKS.md from the 2026-05-17 campaign as oracle: Q1=78ms, Q6=12ms, Q12=23ms)
Σ.G.3 — Add join shape (~3 days)¶
Q3 and Q5 add joins. Extend FusedAggregateExec with an optional
FusedJoinSpec describing the join graph + filter pushdown.
- Q3: 3-way
lineitem ⋈ orders ⋈ customer - Q5: 6-way with region filter
- Single join-aware fused operator handles both shapes by config
Gate: Q3 (20ms SF=1), Q5 (34ms SF=1) — no regression > 5%.
Σ.G.4 — Cost-driven rule (~5 days, the speculative bet)¶
Replace InjectFusedQ1 / Q3 / Q5 / Q6 / Q12 with a single
InjectFusedAggregate rule that:
1. Walks any AggregateExec(Partial) subtree
2. Calls AggregateShapeConfig::detect(node) — returns
Some(shape) if the subtree matches a known pattern
3. If matched + cardinality estimate clears threshold, emit
FusedAggregateExec { shape, ... }
The detect function is the hard part — it has to handle aggregate
+ filter + join patterns without hard-coding TPC-H query shapes.
Most likely outcome: it recognises families of shapes (single-table
aggregate, star-join aggregate, conditional-count aggregate) — not
every conceivable shape, but a meaningful coverage.
Gate: TPC-H 22 SF=1 BENCHMARKS.md — total median wall time
within 10% of the 2026-05-17 baseline. Per-query regressions > 20%
on any of Q1/Q3/Q5/Q6/Q12 are blocking; non-TPC-H benches (e.g.
flow_run_bench synthetic shapes) should gain coverage since the
shape-family detector finds more matches than the hand-written rules.
Σ.G.5 — Q14 + Q9 follow-up (~3 days, opportunistic)¶
Q14 already has a fused operator (FusedQ14FullExec — currently
retired per task #455). Re-add as a FusedAggregateExec shape config
matching lineitem ⋈ part filter + conditional SUM. Same for Q9
(PartSupp join with year extract + SUM).
Gate: Both queries already win at SF=1 (Q14: 19ms beat DuckDB 31ms; Q9: 50ms beat DuckDB 71ms) — keep those margins.
Risks¶
- Hot-loop regression — bundling 5 specialised operators into 1
risks the compiler not specialising as aggressively. Mitigation:
per-shape codegen via
#[inline(always)]+ monomorphised dispatch keyed on aconst SHAPE: u8discriminant. If LLVM doesn't specialise, fall back to per-shapematcharms inside the hot loop (the approach polars uses). - Shape-detection false positives —
AggregateShapeConfig::detectfiring on a non-TPC-H plan that doesn't actually benefit. Mitigation: feature-gate cost-driven dispatch behind--features adaptive_fusedfor the first 2-3 weeks; default off, opt-in via session config. - TPC-H regression — by definition the hand-tuned
FusedQ1Execis as fast as the generic version can be at the same shape. If we lose 5-10% by going generic, that's "we made the engine more honest at a small cost." Acceptable. Losing 20%+ is not. - Σ.G.4 doesn't generalise — the planner rule turns out to need
a per-shape lookup table anyway. Mitigation: even if cost-driven
dispatch falls short, the unified
FusedAggregateExecfrom Σ.G.2-3 is still a clean code-size + maintainability win. Σ.G.4 is the stretch phase; G.2-3 stand on their own.
What we publish¶
docs/FUSED_AGGREGATE_SHAPES.md— the shape catalog (a reference doc that grows over time, like a query optimiser's pattern library)- A README section: "Fused execution: from query-specific to shape-driven" — addresses the "is this still bandaid" critique honestly
- Updated TPC-H BENCHMARKS.md if numbers shift
What we DON'T do in this phase¶
- Generic vectorised aggregate at the kernel level — that's Photon's 5-year project. Σ.G stops at "shape-parameterised operator + cost-driven dispatch". The kernels stay shape-specific.
- Cardinality estimation infrastructure — we use simple selectivity heuristics for Σ.G.4 (e.g. "filter selectivity > 0.1 enables fused path"). Real CBO is its own initiative.
- Aggregate spilling, AQE-style adaptive replans — separate follow-ups; the fused path runs in-memory only for now.
Decisions to lock before starting¶
| # | Question | Default |
|---|---|---|
| 1 | When does Σ.G.2 fold in EnableDictGroupCountRule? |
Σ.G.3 or later — dict-preservation is orthogonal, can stay separate until G.4. |
| 2 | Do we keep per-rule names as deprecation aliases? | One release cycle — InjectFusedQ1 becomes an alias for InjectFusedAggregate { shape: SHAPE_Q1 }. Delete after Σ.G.5. |
| 3 | Where does AggregateShapeConfig::detect live? |
New module crates/ematix-flow-core/src/planner/fused_shape.rs. Pure function — no global state, easy to test. |
| 4 | Test strategy | Same fixture-based oracle tests we use for the rules today; add a "detect" unit test per shape that exercises the matcher in isolation. |
Engineering total¶
Σ.G.1 + G.2 + G.3 = ~9 working days for the deterministic part. Σ.G.4 + G.5 = ~8 working days, with G.4 carrying real risk of extending another week if shape detection turns out to be hard.
Worth doing once the current AWS campaign closes out (i.e. the SF=1 + SF=10 BENCHMARKS.md are in main) so we have a clean oracle.
Locking sequence¶
- Land current AWS campaign cleanup — SF=10 re-run (next campaign), bench.sh fixes ([[project_aws_prebuilt_target_follow_up]])
- Σ.G.1 audit — read-only, no code change. Output a shape catalog.
- Decide whether to commit to Σ.G.2-5 based on what the audit reveals. If 4 of 5 operators are basically identical, G.2 is a single-day unify. If they're heterogeneous, G.1's audit might recommend keeping them separate.