GizmoEdge Takes on the 1 Trillion Row Challenge

What happens when you give a distributed SQL engine a trillion-row dataset? You find out what it's really made of.

Last week, we put GizmoEdge—our distributed, IoT-ready data engine—to the test by running the Coiled 1 Trillion Row Challenge on Azure. The goal: process and summarize one trillion records from the measurements dataset as fast as possible.

Infrastructure Setup

We deployed a 1,000-worker GizmoEdge cluster, each worker powered by DuckDB and orchestrated through Kubernetes. Our cluster ran on Azure Standard E64pds v6 nodes, each providing 64 vCPUs and 504 GiB of RAM.

Each GizmoEdge worker pod was provisioned with 3.8 vCPUs (3800 m) and 30 GiB RAM, allowing roughly 16 workers per node—meaning the test required about 63 nodes in total.

Performance Results

Baseline Query

SELECT COUNT(*) FROM measurements;

Execution time: < 0.5 seconds
Rows counted: 1,000,000,000,000

Aggregation Challenge Query

SELECT station, MIN(measure), MAX(measure), AVG(measure)
FROM measurements
GROUP BY station
ORDER BY station;

Execution time: < 5 seconds
Result set: 412 rows

Each grouped row represented an aggregation of roughly 2.4 billion rows—and GizmoEdge completed it across all workers in seconds.

Watch GizmoEdge complete the challenge:

How GizmoEdge Works

GizmoEdge's architecture is designed for massive scale, high performance, and secure execution.

SQL Parsing & Planning

The GizmoEdge Server receives a SQL query from the client, parses it, and generates two statements:

A worker SQL to execute on each distributed node
A combinatorial SQL to run server-side for final aggregation

Shard Distribution

Each worker requests a data shard from the server. The server responds with:

A SHA-256 hash of the shard file (to verify download integrity)
A token-based authentication handshake that ensures only authorized workers can participate

Workers download, decompress, and materialize their shards into DuckDB databases built from Parquet files.

Secure "Trust But Verify" Model

All worker-server communication runs over TLS-encrypted WebSockets, ensuring confidentiality and authenticity. Each worker:

Authenticates with a signed token validated by the server
Verifies the shard's SHA-256 hash upon download to ensure it matches what the server issued
Computes its own MD5 hash of the shard and returns it to the server
The server compares the hashes—only if they match does it "trust" that worker for subsequent query processing

Parallel Execution & Aggregation

Once trusted, each worker executes its local query through DuckDB and streams intermediate Arrow IPC datasets back to the server over secure WebSockets. The server merges and aggregates all results in parallel to produce the final SQL result—often in seconds.

Heterogeneous Compute: From Cloud to Edge

GizmoEdge isn't limited to Azure VMs. It's designed for heterogeneous computing—running workers across IoT devices, laptops, mobile phones, or cloud clusters simultaneously.

See GizmoEdge distributing queries across AWS, Azure, GCP, and edge devices like iPhones and Kubernetes pods: https://www.youtube.com/watch?v=gIgFKniKAdk

Challenge Details

Want to learn more about the 1 Trillion Row Challenge? You can find full details, including how to access the publicly available dataset, at the official challenge repository: https://github.com/coiled/1trc

The challenge provides a comprehensive benchmark for testing distributed data processing systems at scale, making it an excellent way to evaluate real-world performance capabilities.

GizmoSQL Also Took the Challenge

GizmoEdge isn't the only GizmoData product that tackled the 1 Trillion Row Challenge. GizmoSQL, our single-node DuckDB Arrow Flight SQL server, also completed the challenge with impressive results.

Using a single AWS Graviton 4 instance, GizmoSQL processed the trillion-row dataset in just over 2 minutes. Read about GizmoSQL's approach and results to see how single-node performance compares to distributed execution.

What's Next

GizmoEdge is still pre-production, and we're inviting design partners who want to push the boundaries of distributed analytics.

If your organization works with multi-terabyte or even petabyte-scale data—and wants to see how GizmoEdge can execute your queries in seconds—reach out.

GizmoEdge — The distributed SQL engine for the modern data frontier.