650GB of Data (Delta Lake on S3). Polars vs. DuckDB vs. Daft vs. Spark
209 points
16 hours ago
| 26 comments
| dataengineeringcentral.substack.com
| HN
srilman
8 hours ago
[-]
Hey everyone, I'm a software engineer at Eventual, the team behind Daft! Huge thanks to the op for the benchmark, we're a huge fan of your blog posts and this gave us some really useful insights. For context, Daft is a high-performance data processing engine for AI workloads that works both on single-node and distributed setups.

We're actively looking into the results of the benchmark and hope to share some of our findings soon. From initial results, we found a lot of potential optimizations we could make to our deltalake reader to improve parallelism and our groupby operator to improve pipelining for count aggregations. We're hoping to roll our these improvements over the next couple of releases.

If you're interested to learn more about our findings, check out our GitHub (https://github.com/Eventual-Inc/Daft) or follow us on Twitter (https://x.com/daftengine) and LinkedIn (https://www.linkedin.com/showcase/daftengine) for updates. Also if Daft sounds interesting to you, give us a try via pip install daft!

reply
faizshah
28 minutes ago
[-]
I had to do something like this for a few TB of json recently. The unique thing about this workload was it was a ton of small 10-20mb files.

I found that clickhouse was the fastest, but duckdb was the simplest to work with it usually just works. DuckDB was close enough to the max performance from clickhouse.

I tried flink & pyspark but they were way slower (like 3-5x) than clickhouse and the code was kind of annoying. Dask and Ray were also way too slow, but dask’s parallelism was easy to code but it was just too slow. I also tried Datafusion and polars but clickhouse ended up being faster.

These days I would recommend starting with DuckDB or Clickhouse for most workloads just cause it’s the easiest to work with AND has good performance. Personally I switched to using DuckDB instead of polars for most things where pandas is too slow.

reply
throwaway-aws9
8 hours ago
[-]
650GB? Your data is small, fits on my phone. Dump the hyped tooling and just use gnu tools.

Here's an oldie on the topic: https://adamdrake.com/command-line-tools-can-be-235x-faster-...

reply
faizshah
19 minutes ago
[-]
This isn’t true anymore we are way beyond 2014 Hadoop (what the blog post is about) at this point.

Go try doing an aggregation of 650gb of json data using normal CLI tools vs duckdb or clickhouse. These tools are pipelining and parallelizing in a way that isn’t easy to do with just GNU Parallel (trust me, I’ve tried).

reply
Demiurge
8 hours ago
[-]
What if it was 650TB? This article is obviously a microbenchmark. I work with much larger datasets, and neither awk nor DBD would make a difference to the overall architecture. You need a data catalog, and you need a clusters of jobs at scale, regardless of a data format library, or libraries.
reply
CraigJPerry
5 hours ago
[-]
At 650tb it's not a memory bound problem:

working memory requirements

    1. Assume date is 8 bytes
    2. Assume 64bit counters
So for each date in the dataset we need 16 bytes to accumulate the result.

That's ~180 years worth of daily post counts per gb ram - but the dataset in the post was just 1 year.

This problem should be mostly network limited in the OP's context, decompressing snappy compressed parquet should be circa 1gb/sec. The "work" of parsing a string to a date and accumulating isn't expensive compared to snappy decompression.

I don't have a handle on the 33% longer runtime difference between duckdb and polars here.

reply
adammarples
45 minutes ago
[-]
I think the entire point of the article (reading forward a bit through the linked redshift files posts) is that almost nobody in the world uses datasets bigger than 100Tb, that when they do, they use a small subset anyway, and that 650Gb is a pretty reasonable approximation of the entire dataset most companies are even working with. Certainly in my experience as a data engineer, they're not often in the many terabytes. It's good to know that OOTB duckdb can replace snowflake et all in these situations, especially with how expensive they are.
reply
thinkharderdev
12 minutes ago
[-]
> It's good to know that OOTB duckdb can replace snowflake et all in these situations, especially with how expensive they are.

Does this article demonstrate that though? I get, and agree, that a lot of people are using "big data" tools for datasets that are way too small to require it. But this article consists of exactly one very simple aggregation query. And even then it takes 16m to run (in the best case). As others have mentioned the long execution time is almost certainly dominated by IO because of limited network bandwidth, but network bandwidth is one of the resources you get more of in a distributed computing environment.

But my bigger issue is just that real analytical queries are often quite a bit more complicated than a simple count by timestamp. As soon as you start adding non-trivial compute to query, or multiple joins (and g*d forbid you have a nested-loop join in there somewhere), or sorting then the single node execution time is going to explode.

reply
jgalt212
2 hours ago
[-]
"I've forgotten how to count that low"

https://www.youtube.com/watch?v=3t6L-FlfeaI

reply
andy99
13 hours ago
[-]
reply
willvarfar
7 hours ago
[-]
I often crunch 'biggish data' on a single node using duckdb (because I love using the modern style of painless and efficient SQL engines).

I don't use delta or iceberg (because I haven't needed to; I'm describing what I do, not what you can do :)), but rather just iterate over the underlying parquet files using filename listing or wildcarding. I often run queries on BigQuery and suck down the results to a bunch of ~1GB local parquet files - way bigger than RAM - that I can then mine in duckdb using wildcarding. Works great!

I'm in a world where I get into the weeds of 'this kind of aggregation works much faster on Bigquery than duckdb, or vice versa, so I'll split my job into this part of sql running on Bigquery then feeding into this part running in duckdb'. It's the fun end of data engineering.

reply
miohtama
5 hours ago
[-]
650GB is something one could handle using a local filesystem, no need for complex tooling.
reply
capitol_
2 hours ago
[-]
650 GB fits in ram: https://yourdatafitsinram.net
reply
KeplerBoy
2 hours ago
[-]
what a pointless website. It would be nice if it at least showed appropriately sized cloud instances of just the same list over and over again.
reply
l_c_m
2 hours ago
[-]
This is most misrepresented article on two fronts

1. tested column pruning and the dataset you access would have been 2 columns + metadata for the parquet files so probably fit in memory even without streaming.

2. Most of the processing time would be IO bound on S3 and the access patterns/simultaneous connection limits etc. would have more of an impact than any processing code.

Love that you went through the pain of trying the different systems but I'd like to see an actual larger than memory query.

reply
DiskoHexyl
2 hours ago
[-]
Hardly a surprise, given the nature of Spark and benchmark prerequisites. Comparing a positively ancient distributed JVM-based compute framework running on a single node, with modern native tools like DuckDB or Polars, and all that on a select from a single table- does it tell us something new?

Even Trino runs circles around Spark, with some heavier jobs simply not completing in Spark at all (total data size up to a single PB, with about 10TB of RAM available for compute), and Trino isn't known for its extreme performance. StarRocks is noticeably faster still, so I wouldn't right off distributed compute just yet- at least for some applications.

And even then, performance isn't the most important criterion for an analytics tool choice- more probably depends on the integrations, access control, security, ease of extendability, maintenance, scaling, support by existing instruments. Boring enterprise stuff, sure, but for those older frameworks it's all either readily available, or can be quickly added with little experience (writing a java plugin for Trino is as easy as it gets).

With Duckdb or Polars (if used as a basis for a datalake/house etc) it may degrade into an entire team of engineers wasting resources on implementing the tooling around the tooling instead of providing something actually useful for the business

reply
luizfelberti
13 hours ago
[-]
Honestly this benchmark feels completely dominated by the instance's NIC capacity.

They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)

Minute differences in how these query engines schedule IO would have drastic effects in the benchmark outcomes, and I doubt the query engine itself was constantly fed during this workload, especially when evaluating DuckDB and Polars.

The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.

reply
Scubabear68
12 minutes ago
[-]
This is a really good observation, and matches something I had to learn painfully over 30 years ago. At a Wall Street bank, we were trying to really push the limits with some middleware, and my mentor at the time very quietly suggested "before you test your system's performance, understand the theoretical maximum of your setup first with no work".

The gist was - find your resource limits and saturate them and see what the best possible performance could be, then measure your system, and you can express it as a percentage of optimal. Or if you can't directly test/saturate your limits at least be aware of them.

reply
amluto
13 hours ago
[-]
It would be amusing to run this on a regular desktop computer or even a moderately nice laptop (with a fan - give it a chance!) and see how it does. 650GB will stream in quite quickly from any decent NVMe device, and those 8-16 cores might well be considerably faster than whatever cores the cloud machines are giving you.

S3 is an amazingly engineered product, operates at truly impressive scale, is quite reasonably priced if you think of it as warm-to-very-cold storage with excellent durability properties, and has performance that barely holds a candle to any decent modern local storage device.

reply
switchbak
12 hours ago
[-]
Absolutely. I recently reworked a bunch of tests and found my desktop to outcompete our (larger, custom) Github Action runner by roughly 5x. And I expect this delta to increase a lot as you lean on the local I/O harder.

It really is shocking how much you're paying given how little you get. I certainly don't want to run a data center and handle all the scaling and complexity of such an endeavour. But wow, the tax you pay to have someone manage all that is staggering.

reply
tempest_
12 hours ago
[-]
Everyone wants a data lake when what they have a is a data pond.
reply
baq
6 hours ago
[-]
I think you meant puddle.

cue Peppa Pig laughter sounds

reply
layoric
8 hours ago
[-]
Totally true. I have a trusty old (like 2016 era) X99 setup that I use for 1.2TB of time series data hosted in a timescaledb PostGIS database. I can fetch all the data I need quickly to crunch on another local machine, and max out my aging network gear to experiment with different model training scenarios. It cost me ~$500 to build the machine, and it stays off when I'm not using it.

Much easier obviously dealing with a dataset that doesn't change, but doing the same in the cloud would just be throwing money away.

reply
mrlongroots
10 hours ago
[-]
Yep I think the value of the experiment is not clear.

You want to use Spark for a large dataset with multiple stages. In this case, their I/O bandwidth is 1GB/s from S3. CPU memory bandwidth is 100-200GB/s for a multi-stage job. Spark is a way to pool memory for a large dataset with multiple stages, and use cluster-internal network bandwidth to do shuffling instead of storage.

Maybe when you have S3 as your backend, the storage bandwidth bottleneck doesn't show up in perf, but it sure does show up in the bill. A crude rule of thumb: network bandwidth is 20X storage, main memory bandwidth is 20X network bandwidth, accelerator/GPU memory is 10X CPU. It's great that single-node DuckDB/Polars are that good, but this is like racing a taxiing aircraft against motorbikes.

reply
justincormack
5 hours ago
[-]
Network bandwidth is not 20x storage ant more. An SSD is around 10GB/s now, so similar to 100Gb ethernet.
reply
mrbungie
10 hours ago
[-]
> They used a c5.4xlarge that has peak 10Gbps bandwidth, which at a constant 100% saturation would take in the ballpark of 9 minutes to load those 650GB from S3, making those 9 minutes your best case scenario for pulling the data (without even considering writing it back!)

The query being tested wouldn't scan the full files and in reality the query in most sane engines would be processing much less than 650GB of data (exploiting S3 byte-range reads): i.e. just 1 column: a timestamp, which is also correlated with the partition keys. Nowadays what I would mostly be worried about the distribution of file size, due to API calls + skew; or if the query is totally different to the common query access patterns that skips the metadata/columnar nature of the underlying parquet (i.e. doing an effective "full scan" over all row groups and/or columns).

> The irony of workloads like this is that it might be cheaper to pay for a gigantic instance to run the query and finish it quicker, than to pay for a cheaper instance taking several times longer.

That's absolutely right.

reply
kccqzy
12 hours ago
[-]
10Gbps only? At Google where this type of processing would automatically be distributed, machines had 400Gbps NICs, not to mention other innovations like better TCP congestion control algorithms. No wonder people are tired of distributed computing.
reply
otterley
7 hours ago
[-]
You can get a 600Gbps interface on an Amazon EC2 instance (c8gn.48xlarge), if you’re willing to pay for it.
reply
basilgohar
12 hours ago
[-]
"At Google" is doing all the heavy lifting in your comment here, with all due respect. There is but one Google but remain millions of us who are not "At Google".
reply
kccqzy
11 hours ago
[-]
I’m merely describing the infrastructure that at least partially led to the success of distributed data processing. Also 400Gbps NIC isn’t a Google exclusive. Other clouds and on-premise DCs could buy them from Broadcom or other vendors.
reply
degamad
11 hours ago
[-]
The infra might have a 400Gbps NIC, but if you're buying a small compute slice on that infra, you don't get all the capability.
reply
bushbaba
12 hours ago
[-]
I'm kind of suprised they didn't choose an ec2 instance with higher throughput. S3 can totally eek out 100s of Gibps with the right setup.

BUT the author did say this is the simple stupid naive take, in which case DuckDB and Polars really shined.

reply
dukodk
11 hours ago
[-]
c5 is such a bad instance type, m6a would be so much better and even cheaper, I would love to see this on an m8a.2xlarge (7th and 8th generations don’t use SMT) and that is even cheaper and has up to 15 Gbps
reply
luizfelberti
10 hours ago
[-]
Actually for this kind of workload 15Gbps is still mediocre. What you actually want is the `n` variant of the instance types, which have higher NIC capacity.

In the c6n and m6n and maybe the upper-end 5th gens you can get 100Gbps NICs, and if you look at the 8th gen instances like the c8gn family, you can even get instances with 600Gbps of bandwidth.

reply
blmarket
13 hours ago
[-]
Presto (a.k.a. AWS Athena) might be a faster/better alternative? Also would like to see if 650GB data is available locally.
reply
fifilura
6 hours ago
[-]
Presto is renamed to Trino now.

But I concur with what you say. It is also very cheap in both maintenance and running cost. It is just an amazing tool and you pay (RIP) pennies.

reply
tbcj
1 hour ago
[-]
No, Presto (https://github.com/prestodb/presto) remains alive and well, just Trino gets more attention.
reply
esafak
14 hours ago
[-]
If I understand correctly, polars relies on delta-rs for Delta Lake support, and that is what does not support Deletion vectors: https://github.com/delta-io/delta-rs/issues/1094

It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.

reply
mynameisash
13 hours ago
[-]
> It seems like these single-node libraries can process a terabyte on a typical machine, and you'd have have over 10TB before moving to Spark.

I'm surprised by how often people jump to Spark because "it's (highly) parallelizable!" and "you can throw more nodes at it easy-peasy!" And yet, there are so many cases where you can just do things with better tools.

Like the time a junior engineer asked for help processing 100s of ~5GB files of JSON data which turned out to be doing crazy amounts of string concatenation in Python (don't ask). It was taking something like 18 hours to run, IIRC, and writing a simple console tool to do the heavy lifting and letting Python's multiprocessing tackle it dropped the time to like 35 minutes.

Right cool for the right job, people.

reply
rgblambda
5 hours ago
[-]
I think Spark was the best tool out there when data engineering started taking off, and it just works (provided you don't have to deal with jar dependency hell) so there's not a huge incentive to move away from it.
reply
benrutter
2 hours ago
[-]
This is so true! Even a few years ago, these benchmarks would have been against pandas (instead of polaes and duckdb) and would likely have looked very different.
reply
esafak
13 hours ago
[-]
I used pySpark some time ago when it was introduced to my company at the time and I realized that it was slow when you used python libraries in the UDFs rather than pySpark's own functions.
reply
rmnclmnt
33 minutes ago
[-]
Yes using Python UDFs within Spark pipelines are a hog! That’s because the entire Python context is serialized with cloudpickle and sent over the wire to the executor nodes! (It can represent a few GB of serialized data depending on the UDF and driver process Python context)
reply
jdnier
13 hours ago
[-]
DuckDb has a new "DuckLake" catalog format that would be another candidate to test. https://ducklake.select/
reply
sukhavati
2 hours ago
[-]
for me the issue is that DuckLake's feature of flushing inlined data to parquet is still in alpha. one of the main issues with parquet is when writing small batches you end up with a lot of parquet files that are inefficient to work with using duckdb. to solve this ducklake inlines these small writes to the dbms you choose (postgres) but for a while it couldn't write them back to parquet. last I had checked this feature didn't yet exist, and now it seems to be in alpha which is nice to see, but I'd like some better support before I consider switching some personal data projects over. https://ducklake.select/docs/stable/duckdb/advanced_features...
reply
garganzol
11 hours ago
[-]
DuckLake format has an unresolved built-in chicken and egg conflict: it requires SQL database to represent its catalog. But this is what some people are running away from when they choose Parquet format in the first place. Parquet = easy, SQL = hard, adding SQL to Parquet makes the resulting format hard. I would expect a catalog to be in Parquet format as well, then it becomes something self-bootstrapping and usable.
reply
datacynic
3 hours ago
[-]
DuckLake is more comparable to Iceberg and Delta than to raw parquet files. Iceberg requires a catalog layer too, a file system based one at its simplest. For DuckLake any RDBMS will do, including fs-based ones like DuckDB and SQLite. The difference is that DuckLake will use that database with all its ACID goodness for all metadata operations and there is no need to implement transactional semantics over a REST or object storage API.
reply
matt123456789
10 hours ago
[-]
It is not a chicken and egg problem, it is just a requirement to have an RDBMS available for systems like DuckLake and Hive to store their catalogs in. Metadata is relatively small and needs to provide ACID r/w => great RDBMS use case.
reply
dsp_person
9 hours ago
[-]
What about file-based catalogs with Iceberg? Found one that puts it in a single json file: https://github.com/boringdata/boring-catalog
reply
saxenaabhi
8 hours ago
[-]
Then concurrency suffers since you have to have locks when you update files.

That's also why ducklake performs better than others.

For many use cases this trade-off is worth it.

reply
benrutter
5 hours ago
[-]
I love this article! But I think this insight shouldn't be surprising. Distribution always has overheads, so if you can do things on a single machine it will almost always be faster.

I think a lot of engineers expect 100 computers to be faster than 1, because of the size comparison. But we're really looking at a process here, and a process shifting data between machines will almost always have to do more stuff, and therefore be slower.

Where spark/daft are needed is if you have 1tb of data or something crazy were a single machine isn't viable. If I'm honest though, I've seen a lot of occasions where someone thinks they have that happening, and none so far where they actually do.

reply
pu_pe
6 hours ago
[-]
In places I have worked at that used Databricks, I feel they chose it for the same reasons big orgs use Microsoft: it comes out of a box and has a big company behind it. Technical benchmarks or even cost considerations would be a distant second.
reply
rmnclmnt
31 minutes ago
[-]
Until the product manager ask for the bill… then all of a sudden things get reconsidered
reply
data_marsupial
2 hours ago
[-]
There are real advantages from having a managed data platform compared to managing everything yourself, especially if you have a large number of data teams that need to collaborate.
reply
hobs
1 hour ago
[-]
Yep, and Databricks will have you churning and changing everything on your stack every 18 months (if you want to keep up to date at all) - its not what I would choose as a data partner unless I was just picking what all the other kids at lunch were.
reply
zkmon
6 hours ago
[-]
There are other factors as well, that drive the decision makers to clusters and big-data tech, even when the benchmarks do not justify that. At the root, the reasons are organizational, not technical. Risk aversion seeks to avoid single point of failure, needs accountability, favors outsourcing to specialists etc. Performance alone is not going to beat all of that.
reply
willvarfar
6 hours ago
[-]
Often, at the medium and large sized companies its not 'risk aversion', its resume padding.

Architects want to build big impressive systems that justify their position and managers want that too because success is judged by size of systems and number of staff under management, not its efficiency; its all about perverse incentives.

This is just a tax the scientists trying to use whatever the company settles on have to pay every time they wait for queries to run.

These days scientists can just suck down a copy of a bunch of data to their laptop or a cheap cloud VM and do their crunching 'locally' there. The company data swamp is just something they have to interface with occasionally.

Of course things go pear-shaped if they get detected, so don't tell anyone :D

reply
zkmon
6 hours ago
[-]
Quite true. There are hardly any technical justifications for this madness, other than seeking a bloat of work and team size at the expense of huge spend.
reply
co0lster
13 hours ago
[-]
650GB relates to size of parquet files which are compressed in reality it’s way more.

32 GB of parquet cannot fit in 32GB of RAM

reply
m00x
8 hours ago
[-]
You don't need it to if you just need specific columns. This is the advantage of columnar storage.
reply
barrkel
12 hours ago
[-]
This would speed things up since it looks like the bottleneck here is I/O.
reply
mettamage
6 hours ago
[-]
> Truly, we have not been thinking outside the box with the Modern Lake House architecture. Just because Pandas failed us doesn’t mean distributed computing is our only option.

Well yea, I would have picked polars as well. To be fair , I didn’t know about some of these.

reply
ayhanfuat
4 hours ago
[-]
What is the point of simulating 650GB data with ~40 columns if you are going to use a single column for testing? Is that even 16GB?
reply
KeplerBoy
2 hours ago
[-]
It's a strided array and slows down memory access.
reply
ayhanfuat
1 hour ago
[-]
It's a parquet file. Column data is stored in contiguous pages (and that's how duckdb and polars read them).
reply
KeplerBoy
1 hour ago
[-]
Okay, wasn't aware of that.
reply
nevi-me
7 hours ago
[-]
The main reason why clusters still make sense is because you'll have a bunch of people accessing subsets of much larger data regularly, or competing processes that need to have their output ready at around the same time. You distribute not only compute, but also I/O, which others are pointing out to likely dominate the runtime of the benchmarks.

Beyond Spark (one shouldn't really be using vanilla Spark anyways, see Apache Comet or Databricks Photon), distributing my compute makes sense because if a job takes an hour to run, (ignoring overnight jobs) there will be a bunch of people waiting for that data for an hour.

If I run a 6 node cluster that makes the data available in 10 minutes, then I save in waiting time. And if I have 10 of those jobs that need to run at the same time, then I need a burst of compute to handle that.

That 6 node cluster might not make sense on-prem unless I can use the compute for something else, which is where PAYG on some cloud vendor makes sense.

reply
nikita2206
5 hours ago
[-]
I am not in data eng, but I do occasionally query data lake at my company. Where does Snowflake stand in this? (specially looking at that Modern Data Stack image)
reply
benrutter
5 hours ago
[-]
I beleive snowflake has its own distributed query engine, similar to say, big query.

It's a bit of a tricky comparison because snowflake, and a lot of other tools that get referred to as "modern data stack" are very vendor based. If you're using snowflake, you're probaby using it on snowflake provided architecture with a whole load of proprietary stuff. You can't "swap in" snowflake on the same hardware like you can with spark, daft, duckdb, polars etc.

That said, iirc benchmarks normally place it very similar to spark. It's distributed, so I'd be very surprised if it wasn't in the spark/daft ballpark rather than polars/duckdb.

reply
roeja
4 hours ago
[-]
Snowflake has their own sql engine and is more of a serverless option. Databricks started off with spark but now also has a sql engine(optional serverless) as well, they are using spark in the article.

The delta format is Databricks lakehouse file format, snowflake uses iceberg I believe.

Both Snowflake and Databricks also provide a ton of other features like ML, Orchestration and governance. Motherduck would be the direct competitor here.

Saying that there are now extensions to query snowflake or databricks data from duckdb for simple ad hoc querying.

Duckdb is fantastic and has saved me so many times strongly recommended.

reply
prpl
4 hours ago
[-]
"but now also has a sql engine"

it has had a combined SQL and dataframe engine since March 2015...

reply
throwaw12
5 hours ago
[-]
I am curious as well about this, we use Snowflake, but as a software engineer I want to understand how Spark/Databricks is different, what are we missing out?

How we work with data is simple, if SQL+dashboard solves the problem then we do it in Snowflake, if we need something more advanced, then code + bunch of SQL.

Pretty sure ML engineers work in different ways, but I don't know that side well

reply
jiehong
7 hours ago
[-]
This is somewhat real world, except real world would probably index some parquet columns to avoid a full scan like that.
reply
baq
6 hours ago
[-]
650GB would’ve fit in a not-exotic-at-all basically off the shelf server ram a decade ago
reply
dogman123
11 hours ago
[-]
One thing that I never really see mentioned in these types of articles is that a lot of DuckDB’s functionality does not work if you need to spill to disk. iirc, percentiles/quartiles (among other aggregate functions) caused DuckDB to crash out when it spilled to disk.
reply
jtbaker
11 hours ago
[-]
I’m pretty sure I’ve done this and not had any issues. Can you share a minimum reproducible example?
reply
abofh
11 hours ago
[-]
6$ of data does not a compelling story make. This is not 1998
reply
zigzag312
4 hours ago
[-]
DataFusion is another option I would be interested to see in a comparison like this.
reply
gdevenyi
11 hours ago
[-]
I hate this screenshots for commands and outputs everywhere
reply
tacker2000
5 hours ago
[-]
I also hate it. Cant read it properly on mobile and cant copy it if needed.

But the worst thing is that these are not even real screenshots, the author pasted the text into some terminal window screenshot generator tool.

reply
hnidiots3
11 hours ago
[-]
650GB? We have 72PB IN S3, know people who have multiple EB in S3.
reply
0cf8612b2e1e
11 hours ago
[-]
This is not a game of, “mine is bigger than yours”. Many many workloads in the wild are smaller than this.

Motherduck have a few posts about how few people have “big data”. https://motherduck.com/blog/redshift-files-hunt-for-big-data...

reply
esafak
11 hours ago
[-]
That has to be multiple data sets. How big are your individual nightly jobs, and what are you processing them with?
reply