Skip to content

Π — Aggregate (and join) spilling

Goal: ematix-flow doesn't OOM when the hash table doesn't fit in memory. Spill to local SSD by default; S3 as a backstop via the existing LambdaExecutor / S3RunLog plumbing.

Status: stub. Will be promoted to a full plan after Σ.G + Φ.1 have landed and we have a clean baseline.

Why it's its own phase

  • Production correctness: any user with a SF=100+ workload hits OOM on aggregate hash tables today.
  • Independent of Σ.G + Φ: spilling is a runtime infrastructure concern, not a kernel or planner concern. Different engineer can work on it in parallel.
  • Different test surface: needs fault-injection harness (kill the process mid-bucket-flush, recover, verify result), not microbenches.

Scope (rough)

Sub-phase What Effort
Π.1 Spill-aware AggregateExec: external sort + run merging to local SSD when hash size > threshold ~2 wk
Π.2 Same for hash join ~2 wk
Π.3 S3 backstop for the "local SSD ran out too" case — chunk the spill to S3 with multipart upload, reuse the [[S3RunLog]] pattern ~2 wk
Π.4 AQE-style retry: if a hash exec spills, restart with higher partition count and skip spilling ~3 wk

Total: ~9 wk for the full surface.

Non-goals

  • Distributed shuffle spilling — overlaps with the Phase Z distributed campaign. That work owns network-side spilling; this phase owns local SSD + S3 spilling for single-host workers.
  • DataFusion's MemoryReservation quota system — DF already has one. We integrate, not replace.

Gate

  • TPC-H SF=100 (~75 GB) runs to completion on a 16 GB t3.xlarge without OOM
  • Comparison: how does our spilling overhead compare to DuckDB and Spark at SF=100? (Industry knows Spark's spilling is heavyweight, DuckDB's is competitive)

What we don't know yet

  • The right threshold for spill decision (memory pressure %? hash cardinality estimate?)
  • Whether we need our own spill format or can reuse Arrow IPC
  • How aggressive the AQE retry should be

These get scoped in the planning doc proper.

Sequencing

  • Prerequisites: none — independent of Σ.G + Φ. Could start any time after the current AWS campaign closes out.
  • Why we defer: Σ.G + Φ deliver visible wins on existing TPC-H numbers, which the team needs to validate direction. Π pays off on workloads we don't currently benchmark (SF=100+). Lower immediate signal-to-effort.