Skip to content

ematix-flow

Move data between databases, files, and streams from Python. 5.87× faster than PySpark. No JVM needed.

ematix-flow is a Python library for moving data between databases (Postgres, MySQL, SQLite, DuckDB), files (Parquet, CSV, JSON, ORC, Delta Lake — local disk or S3), and streaming sources (Kafka, Pub/Sub, Kinesis, RabbitMQ). Declare a target table and a load strategy — append, merge, or slowly-changing dimension — and the Rust + Apache Arrow engine handles correctness: schema evolution, watermarks, at-least-once delivery, change-data-capture.

The 5.87× headline is geomean across all 22 TPC-H queries at SF=1 — no cluster, no scheduler, one pip install.

Where to start

  • User guide — step-by-step for every surface.
  • Roadmap — what's still on the todo list, by priority.
  • GitHub repo — source, issue tracker, discussions.

What it is

Three complementary surfaces in one repo:

  1. Declarative table management for Postgres — the original v0.1 scope. Decorator-driven schemas, normalization markers, SCD2 with event-time, run history, watermarks, post-load transforms, polars/pandas/pyspark interop, ML feature store. Stable.

  2. Multi-backend streaming pipelines — Phases 30–38. Source from any of 4 streaming backends, write to any of 6 storage backends, manual offset commits, dead-letter patterns, Confluent Schema Registry support, long-running flow consume daemon binary. Stable.

  3. Stream processing — Phase 39. DataFusion-backed mid-stream SQL transforms, tumbling / hopping / session windows with watermark-driven emit + late-data handling, keyed time-windowed stream-stream inner joins backed by a durable Postgres StateStore for crash-recoverable state + atomic offset commits. Recently shipped.

All three share a common Rust core that does Arrow record-batch IO under the hood.

Install

pip install ematix-flow
# Or: pip install "ematix-flow[df]"     for polars/pandas helpers
# Or: pip install "ematix-flow[spark]"  for pyspark helpers
pip install pyarrow                    # required for streaming pyclasses

Tests

  • 428 Rust core lib unit tests
  • 81 Rust CLI lib unit tests
  • ~80 Rust testcontainers integration tests (--ignored; opt-in Docker)
  • ~313 default Python tests + ~196 testcontainers-gated Python tests

clippy + fmt clean on stable Rust.

Phase status

Phase What Status
0–14 v0.1 declarative Postgres
15–20 ML feature store
21–25 Decorator API
26–28 Normalization + transforms
30–33 Multi-backend DBs
34 Object stores
35 Delta Lake
36–37 Streaming sources/targets
38 flow consume CLI
Py.1–Py.6 Python streaming bindings
39.1–39.3 SQL transforms + lookups
39.4 Tumbling + hopping windows
39.5a Session windows + StateStore
39.5b Stream-stream join
Iceberg Iceberg target backend ⏸ deferred (arrow ABI)
Unified API Single declaration surface ⏸ design only

See Roadmap for remaining work.

License

Apache-2.0