We're actively looking into the results of the benchmark and hope to share some of our findings soon. From initial results, we found a lot of potential optimizations we could make to our deltalake reader to improve parallelism and our groupby operator to improve pipelining for count aggregations. We're hoping to roll our these improvements over the next couple of releases.
If you're interested to learn more about our findings, check out our GitHub (https://github.com/Eventual-Inc/Daft) or follow us on Twitter (https://x.com/daftengine) and LinkedIn (https://www.linkedin.com/showcase/daftengine) for updates. Also if Daft sounds interesting to you, give us a try via pip install daft!
I found that clickhouse was the fastest, but duckdb was the simplest to work with it usually just works. DuckDB was close enough to the max performance from clickhouse.
I tried flink & pyspark but they were way slower (like 3-5x) than clickhouse and the code was kind of annoying. Dask and Ray were also way too slow, but dask’s parallelism was easy to code but it was just too slow. I also tried Datafusion and polars but clickhouse ended up being faster.
These days I would recommend starting with DuckDB or Clickhouse for most workloads just cause it’s the easiest to work with AND has good performance. Personally I switched to using DuckDB instead of polars for most things where pandas is too slow.
Here's an oldie on the topic: https://adamdrake.com/command-line-tools-can-be-235x-faster-...
Go try doing an aggregation of 650gb of json data using normal CLI tools vs duckdb or clickhouse. These tools are pipelining and parallelizing in a way that isn’t easy to do with just GNU Parallel (trust me, I’ve tried).
working memory requirements
1. Assume date is 8 bytes
2. Assume 64bit counters
So for each date in the dataset we need 16 bytes to accumulate the result.That's ~180 years worth of daily post counts per gb ram - but the dataset in the post was just 1 year.
This problem should be mostly network limited in the OP's context, decompressing snappy compressed parquet should be circa 1gb/sec. The "work" of parsing a string to a date and accumulating isn't expensive compared to snappy decompression.
I don't have a handle on the 33% longer runtime difference between duckdb and polars here.
Does this article demonstrate that though? I get, and agree, that a lot of people are using "big data" tools for datasets that are way too small to require it. But this article consists of exactly one very simple aggregation query. And even then it takes 16m to run (in the best case). As others have mentioned the long execution time is almost certainly dominated by IO because of limited network bandwidth, but network bandwidth is one of the resources you get more of in a distributed computing environment.
But my bigger issue is just that real analytical queries are often quite a bit more complicated than a simple count by timestamp. As soon as you start adding non-trivial compute to query, or multiple joins (and g*d forbid you have a nested-loop join in there somewhere), or sorting then the single node execution time is going to explode.
I don't use delta or iceberg (because I haven't needed to; I'm describing what I do, not what you can do :)), but rather just iterate over the underlying parquet files using filename listing or wildcarding. I often run queries on BigQuery and suck down the results to a bunch of ~1GB local parquet files - way bigger than RAM - that I can then mine in duckdb using wildcarding. Works great!
I'm in a world where I get into the weeds of 'this kind of aggregation works much faster on Bigquery than duckdb, or vice versa, so I'll split my job into this part of sql running on Bigquery then feeding into this part running in duckdb'. It's the fun end of data engineering.
1. tested column pruning and the dataset you access would have been 2 columns + metadata for the parquet files so probably fit in memory even without streaming.
2. Most of the processing time would be IO bound on S3 and the access patterns/simultaneous connection limits etc. would have more of an impact than any processing code.
Love that you went through the pain of trying the different systems but I'd like to see an actual larger than memory query.
Even Trino runs circles around Spark, with some heavier jobs simply not completing in Spark at all (total data size up to a single PB, with about 10TB of RAM available for compute), and Trino isn't known for its extreme performance. StarRocks is noticeably faster still, so I wouldn't right off distributed compute just yet- at least for some applications.
And even then, performance isn't the most important criterion for an analytics tool choice- more probably depends on the integrations, access control, security, ease of extendability, maintenance, scaling, support by existing instruments. Boring enterprise stuff, sure, but for those older frameworks it's all either readily available, or can be quickly added with little experience (writing a java plugin for Trino is as easy as it gets).
With Duckdb or Polars (if used as a basis for a datalake/house etc) it may degrade into an entire team of engineers wasting resources on implementing the tooling around the tooling instead of providing something actually useful for the business
They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)
Minute differences in how these query engines schedule IO would have drastic effects in the benchmark outcomes, and I doubt the query engine itself was constantly fed during this workload, especially when evaluating DuckDB and Polars.
The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
The gist was - find your resource limits and saturate them and see what the best possible performance could be, then measure your system, and you can express it as a percentage of optimal. Or if you can't directly test/saturate your limits at least be aware of them.
S3 is an amazingly engineered product, operates at truly impressive scale, is quite reasonably priced if you think of it as warm-to-very-cold storage with excellent durability properties, and has performance that barely holds a candle to any decent modern local storage device.
It really is shocking how much you're paying given how little you get. I certainly don't want to run a data center and handle all the scaling and complexity of such an endeavour. But wow, the tax you pay to have someone manage all that is staggering.
cue Peppa Pig laughter sounds
Much easier obviously dealing with a dataset that doesn't change, but doing the same in the cloud would just be throwing money away.
You want to use Spark for a large dataset with multiple stages. In this case, their I/O bandwidth is 1GB/s from S3. CPU memory bandwidth is 100-200GB/s for a multi-stage job. Spark is a way to pool memory for a large dataset with multiple stages, and use cluster-internal network bandwidth to do shuffling instead of storage.
Maybe when you have S3 as your backend, the storage bandwidth bottleneck doesn't show up in perf, but it sure does show up in the bill. A crude rule of thumb: network bandwidth is 20X storage, main memory bandwidth is 20X network bandwidth, accelerator/GPU memory is 10X CPU. It's great that single-node DuckDB/Polars are that good, but this is like racing a taxiing aircraft against motorbikes.
The query being tested wouldn't scan the full files and in reality the query in most sane engines would be processing much less than 650GB of data (exploiting S3 byte-range reads): i.e. just 1 column: a timestamp, which is also correlated with the partition keys. Nowadays what I would mostly be worried about the distribution of file size, due to API calls + skew; or if the query is totally different to the common query access patterns that skips the metadata/columnar nature of the underlying parquet (i.e. doing an effective "full scan" over all row groups and/or columns).
> The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.
That's absolutely right.
BUT the author did say this is the simple stupid naive take, in which case DuckDB and Polars really shined.
In the c6n and m6n and maybe the upper-end 5th gens you can get 100Gbps NICs, and if you look at the 8th gen instances like the c8gn family, you can even get instances with 600Gbps of bandwidth.
But I concur with what you say. It is also very cheap in both maintenance and running cost. It is just an amazing tool and you pay (RIP) pennies.
It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.
I'm surprised by how often people jump to Spark because "it's (highly) parallelizable!" and "you can throw more nodes at it easy-peasy!" And yet, there are so many cases where you can just do things with better tools.
Like the time a junior engineer asked for help processing 100s of ~5GB files of JSON data which turned out to be doing crazy amounts of string concatenation in Python (don't ask). It was taking something like 18 hours to run, IIRC, and writing a simple console tool to do the heavy lifting and letting Python's multiprocessing tackle it dropped the time to like 35 minutes.
Right cool for the right job, people.
That's also why ducklake performs better than others.
For many use cases this trade-off is worth it.
I think a lot of engineers expect 100 computers to be faster than 1, because of the size comparison. But we're really looking at a process here, and a process shifting data between machines will almost always have to do more stuff, and therefore be slower.
Where spark/daft are needed is if you have 1tb of data or something crazy were a single machine isn't viable. If I'm honest though, I've seen a lot of occasions where someone thinks they have that happening, and none so far where they actually do.
Architects want to build big impressive systems that justify their position and managers want that too because success is judged by size of systems and number of staff under management, not its efficiency; its all about perverse incentives.
This is just a tax the scientists trying to use whatever the company settles on have to pay every time they wait for queries to run.
These days scientists can just suck down a copy of a bunch of data to their laptop or a cheap cloud VM and do their crunching 'locally' there. The company data swamp is just something they have to interface with occasionally.
Of course things go pear-shaped if they get detected, so don't tell anyone :D
32 GB of parquet cannot fit in 32GB of RAM
Well yea, I would have picked polars as well. To be fair , I didn’t know about some of these.
Beyond Spark (one shouldn't really be using vanilla Spark anyways, see Apache Comet or Databricks Photon), distributing my compute makes sense because if a job takes an hour to run, (ignoring overnight jobs) there will be a bunch of people waiting for that data for an hour.
If I run a 6 node cluster that makes the data available in 10 minutes, then I save in waiting time. And if I have 10 of those jobs that need to run at the same time, then I need a burst of compute to handle that.
That 6 node cluster might not make sense on-prem unless I can use the compute for something else, which is where PAYG on some cloud vendor makes sense.
It's a bit of a tricky comparison because snowflake, and a lot of other tools that get referred to as "modern data stack" are very vendor based. If you're using snowflake, you're probaby using it on snowflake provided architecture with a whole load of proprietary stuff. You can't "swap in" snowflake on the same hardware like you can with spark, daft, duckdb, polars etc.
That said, iirc benchmarks normally place it very similar to spark. It's distributed, so I'd be very surprised if it wasn't in the spark/daft ballpark rather than polars/duckdb.
The delta format is Databricks lakehouse file format, snowflake uses iceberg I believe.
Both Snowflake and Databricks also provide a ton of other features like ML, Orchestration and governance. Motherduck would be the direct competitor here.
Saying that there are now extensions to query snowflake or databricks data from duckdb for simple ad hoc querying.
Duckdb is fantastic and has saved me so many times strongly recommended.
it has had a combined SQL and dataframe engine since March 2015...
How we work with data is simple, if SQL+dashboard solves the problem then we do it in Snowflake, if we need something more advanced, then code + bunch of SQL.
Pretty sure ML engineers work in different ways, but I don't know that side well
But the worst thing is that these are not even real screenshots, the author pasted the text into some terminal window screenshot generator tool.
Motherduck have a few posts about how few people have “big data”. https://motherduck.com/blog/redshift-files-hunt-for-big-data...