AWS campaign harness — fixes to land before the next run¶
Issues caught during the 2026-05-17 SF=1 + SF=10 validation campaign (instance i-0cc2f02c07d56a083). Each item below either silently lost data, wasted compute, or both. Fix them in a single infra PR before re-running the campaign.
Build-side wastes¶
F.1 — Pre-build target.tar.gz via GitHub Actions, ship via S3¶
Today: every fresh EC2 boot runs cargo build --release --workspace
from scratch — ~15 min, identical work each time.
Fix: GHA workflow on push-to-main builds the workspace on native
x86 Linux runners, tars target/release/, uploads to a dedicated
S3 bucket keyed by commit SHA. Userdata downloads + extracts before
running bench.sh.
Savings: ~10 min per campaign (the cargo test compile in stage
01 still has to compile test binaries from #[cfg(test)] code).
Setup cost: ~2 hr one-time GHA workflow + IAM grant.
Alternatives rejected: - Docker buildx on Mac: Apple Silicon → qemu emulation, ~3-5× slower than native EC2. - Cross-compile with zig: works for the Lambda wheel (single binary) but the workspace's C deps (deltalake, datafusion, ematix-parquet, libpq, sasl2, libsqlite3-sys, rdkafka-sys) are too fragile. - "Bake docker image at end of first successful boot + reuse via ECR": clean but the image is 3-5 GB vs GHA tarball ~500 MB.
F.2 — Drop stage 01 cargo test --release --workspace OR use a faster profile¶
Today: stage 01 runs the full workspace test suite with the LTO release profile. ~30 min wall-clock (most of it LTO-link of test binaries). CI already covers this on every PR.
Fix: either drop the stage (CI is authoritative for unit-test
correctness on x86) or use --profile=release-no-lto purely for
testing.
Savings: ~20 min per campaign.
F.2a — Principle: nothing the campaign runs should overlap with CI¶
Today: stages 01 (cargo test --release --workspace) and 08
(cargo test ... dict_preservation_end_to_end) both ran tests that
CI already exercises on every PR. Stage 01 ate ~30 min; stage 08
ate another ~25 min before being SIGTERM'd. ~1 hour of paid EC2
time on work CI already does for free.
Principle for future campaign design: the campaign budget is for things CI can't do — x86 AVX-512 hardware, real-S3 / real-Lambda / real-EKS, SF=10+ data scales. Unit-test correctness on x86 is CI's job. If a stage's exit criterion is "tests pass," it doesn't belong in the campaign.
Concrete fix for the next campaign:
- Drop stage 01 (cargo test --release --workspace) entirely.
- Drop stage 08 (cargo test ... dict_preservation_end_to_end)
entirely.
- For the dict-preservation validation we do care about, run the
TPC-H triangulation with with_dict_preservation(true) and assert
EnableDictGroupCountRule fires — that's a behavioural check,
not a unit-test check, and CI can't run it (CI doesn't have
SF=10 lineitem on disk).
- Audit each remaining stage: does its failure mode reproduce in CI?
If yes, drop it. If no, keep it.
Savings: ~50 min per campaign, plus removes ambiguity about
"what does this stage even prove." Compounds with F.1 (prebuilt
target) — if we don't run cargo test, we don't need the test
binaries pre-built either.
F.3 — Consolidate feature sets to avoid 3 separate compiles¶
Today:
- userdata builds with default features
- stages 03/03a/03b invoke cargo run against ematix-parquet
examples with various feature combos
- stages 06/07 invoke cargo run --features triangulation on
ematix-flow-core
- Each new combination triggers a full re-codegen of dependent
crates.
Fix: at userdata level, do one cargo build --release --workspace
--all-features (or the union of needed features). Then each bench
stage invokes cargo run --no-default-features --features <stage>
against the pre-built target — should be a near-instant link.
Savings: ~10 min per campaign (compounds with F.1).
F.4 — Stages share compile artefacts via cargo test --no-run¶
Today: stage 08 (dict-preservation test) recompiles
ematix-flow-core's test binary from scratch, even though stage 01
already had it in flight. Cargo's build cache reuses, but the
cargo test driver re-evaluates everything.
Fix: add a "warmup" step at top of bench.sh:
Compiles every test binary once. Downstream cargo test --release -p X
my_test runs are near-instant.
Savings: ~15 min per campaign (compounds with F.1, F.2).
Stage-ordering + data-flow bugs¶
F.5 — Move stages 03 + 03a after stage 05 (TPC-H data must exist first)¶
Today: stages 03-emat-parquet-avx-benches +
03a-emat-adaptive-dispatch-bench require TPCH_DATA_DIR to exist,
but stages 04-tpch-generate-sf1 + 05-tpch-generate-sf10 come
AFTER. Both 03/03a ran and exited rc=0 with "TPC-H data not found"
— silently lost the AVX bench data for the 2026-05-17 campaign.
Fix: either renumber 03/03a to land after 05, or add at top of each:
The exit-0 path makes the dependency declarative; renumbering makes it implicit. Renumbering preferred for clarity.
F.6 — Per-scale-factor BENCHMARKS-SF{N}.md + explicit S3 upload¶
Today: stage 06 (SF=1 triangulation) writes
/opt/ematix/ematix-flow/BENCHMARKS.md. Stage 07 (SF=10) writes to
the same path — 07 overwrites 06. Worse, bench.sh only uploads
stdout (which is tail -400-truncated, losing per-query results
beyond Q05).
The 2026-05-17 campaign saved SF=1 via emergency SSM snapshot but lost the SF=10 BENCHMARKS.md because stage 07 apparently didn't reach end-of-main.
Fix:
1. Add --output BENCHMARKS-SF{N}.md flag to tpch_triangulation_bench
example, accept the path from the stage harness.
2. Bench.sh stage 06: --output BENCHMARKS-SF1.md; stage 07:
--output BENCHMARKS-SF10.md.
3. After each stage, explicitly upload BENCHMARKS-SF1.md /
upload BENCHMARKS-SF10.md to S3.
F.7 — Investigate why stage 07 stopped at Q05¶
Today: stage 07 SF=10 triangulation ran ~90 sec, exited rc=0, but stdout shows only Q01-Q05 of the 22 queries. Two possibilities: (a) the bench actually completed 22 queries and stdout was tail-truncated (but then BENCHMARKS.md should be SF=10 content, which it wasn't); (b) something crashed mid-bench at Q06+ and the BENCHMARKS.md write was never reached.
Fix path: before re-running the campaign, instrument
tpch_triangulation_bench to write per-query incremental progress
to a file (not just at end), so a crash at Q12 still leaves Q01-Q11
on disk.
Upload-path bugs¶
F.8 — Don't swallow upload errors (already in place locally, not on main)¶
Already fixed in the patched bench.sh I uploaded to S3 as
fixes/bench.sh during the campaign, but the fix lives on the
local Mac, not on main. The fix:
upload() {
local src="$1" dst="$2"
if ! "$AWS_BIN" s3 cp "$src" "s3://$BUCKET/$PREFIX/$dst" --region "$REGION" \
>>"/tmp/upload-errors.log" 2>&1; then
echo "UPLOAD FAIL rc=$? src=$src dst=$dst" >> "/tmp/upload-errors.log"
fi
}
Plus an aws --version sanity-check at the top of bench.sh that
exit 3s if the CLI isn't on PATH. Bring this onto main as part of
the infra PR.
F.9 — stage() should not toggle set -e¶
Already fixed locally. Original stage() did set +e; bash -c …;
set -e — but the outer script never had -e enabled, so the trailing
set -e armed a foot-gun that would fire on the next non-zero
return from any command. Cause of mysterious "campaign ran for a
while then died" failures on prior boots. Fix is to remove the
set +e/-e toggle entirely.
F.10 — Glob-resilient chmod in userdata heredoc¶
Already fixed locally. chmod +x .../lambda/*.sh under set -e
aborted the whole bench if the dir was empty (literal *.sh failure).
Fix: 2>/dev/null || true on each chmod.
F.11 — Chown $LOG to ec2-user before the sudo-to-ec2-user heredoc¶
Already fixed locally. tee -a "$LOG" inside the sudo-to-ec2-user
RUNBENCH heredoc would EPERM against root-owned /var/log/ematix-bench.log.
Pipefail propagated the failure, aborting the bench before any S3
upload happened.
Toolchain / dependency traps¶
F.12 — perl-FindBin + OPENSSL_NO_VENDOR=1 for AL2023 openssl-sys¶
Already in patched userdata.sh. AL2023's perl ships without the
FindBin core module. openssl-sys's vendored build needs it. Both
perl-FindBin perl-IPC-Cmd perl-File-Compare perl-File-Copy added
to dnf, plus OPENSSL_NO_VENDOR=1 exported before cargo build.
F.13 — cmake for rdkafka-sys¶
Already in patched userdata.sh. rdkafka-sys uses CMake to
build its bundled librdkafka. AL2023 doesn't ship cmake by default.
Added to the dnf install list.
F.14 — ${KUBECTL_VERSION} shell-var escape in terraform-templated userdata¶
Already fixed. Terraform's templatefile() walks the entire
template — including comments — interpreting ${X} as a variable
substitution. For shell variables that need to survive, escape $
as $$. This was the cause of the first boot's kubectl install
failure.
F.15 — Docker socket perms for sudo'd ec2-user mid-userdata¶
Already in patched userdata.sh. usermod -aG docker ec2-user
only takes effect at next login. Every sudo -u ec2-user bash later
in userdata inherits the supplementary-group set from the existing
session — no docker group, so docker buildx build fails. Fix:
wait for dockerd ready then chmod 666 /var/run/docker.sock. Risk
acceptable on single-tenant short-lived EC2.
F.16 — Commit infra/test-validation/ to main BEFORE running¶
Today (the worst finding from 2026-05-17): all Phase B/C/D
stages (10 s3-runlog, 20-22 Lambda, 30-31 K8s, 40 distributed-e2e)
failed with No such file or directory because the entire
infra/test-validation/ directory is untracked in git. The EC2's
git clone origin/main returned a tree without the test scripts, the
Lambda packager, the K8s submit script, or the Python integration
tests. Only bench.sh worked because I uploaded it to S3 and
SSM-copied it during the in-place fix.
Effective campaign value lost: 5 of 7 phases (Phase B, C, D x2, distributed-e2e) — the entire "real AWS service" half of the campaign. Only Phase A actually exercised real infrastructure.
Fix: PR the infra/test-validation/ directory to main as part
of the campaign-cleanup PR. Each script lives where bench.sh expects
it:
- infra/test-validation/scripts/bench.sh
- infra/test-validation/lambda/{build-package.sh,deploy.sh,handler.py}
- infra/test-validation/k8s/build-and-push.sh
- infra/test-validation/tests/{test_s3_runlog_real,test_lambda_real,test_k8s_real,test_distributed_e2e}.py
- infra/test-validation/Dockerfile.flow-worker
- infra/test-validation/main.tf + variables + outputs + IAM policies
Smoke gate: before the next paid EC2 run, git ls-files infra/test-validation/
should return all the script paths listed above.
Priority ranking¶
Land all of these in a single infra PR before the next campaign:
- F.16 (commit the whole infra/ directory) — without this, 5 of 7 phases produce no real data
- F.5 (stage ordering) — most important, lost the AVX bench data
- F.6 (per-SF BENCHMARKS) — lost the SF=10 results
- F.2a (drop CI-overlapping stages 01 + 08) — saves ~50 min, removes ambiguity
- F.1 (prebuilt target via GHA) — biggest time/money saver
- F.4 (shared test-binary compile) — only matters if we keep cargo test stages
- F.2 (drop or de-LTO stage 01) — subsumed by F.2a if we just drop it
- F.3 (feature-set consolidation) — saves 10 min, harder to verify
- F.7 (incremental Q write) — defensive; safety net for crashes
- F.8-F.15 (the already-applied-locally fixes) — bring to main
PR title: infra(aws-campaign): bench.sh + userdata.sh fixes from 2026-05-17 run