We now have even more layers of abstraction (Airflow, dbt, Snowflake) applied to datasets that often fit entirely in RAM.
I've seen startups burning $5k/mo on distributed compute clusters to process <10GB of daily logs, purely because setting up a 'Modern Data Stack' is what gets you promoted, while writing a robust bash script is seen as 'unscalable' or 'hacky'. The incentives are misaligned with efficiency.
I think a lot of people don't realize machines come with TBs of RAM and hundreds of physical cores. One machine is fucking huge these days.
The problem we have is fucked up piles of shit not that we don’t have kubernetes and don’t have containers.
I can maybe make a case for running in containers if you need some specific security properties but .. mostly I think the proliferation of 'fucked up piles of shit' is the problem.
Naturally, that detaches all your containers. And theres no seamless reattach for control plane restart.
(Large EKS cluster)
It is not about what you are doing, it is always about how you do it.
This was the same with doing OCR analysis of assembly and production manuals. Quick and dirty, it would of took over 24 hours of processing time, after moving to semaphores with parallelization it took less than two hours to process all the information.
Unless, of course, you have multiple options and you don’t want to work for a company that’s looking for dumb stuff in interviews.
I optimize my answers for the companies I want to work for, and get rejected by the ones I don't. The hardest part of that strategy is coming to terms with the idea that I constantly get rejected by people that I think are mostly <derogatory_words_here>, but I've developed thick skin over the years.
I'd much rather spend a year unemployed (and do a ton of painful interviews) and find a company who's values align with mine, than work for a year on a team I disagree with constantly and quit out of frustration.
I also believe that running a broken interview process actively selects for qualities you actually don't want, so it's much more likely that teams conducting those interviews aren't teams I want to work on.
Edit: As credence for my claims, the best team I've ever worked on was a team I did 90%+ of the hiring for, and we didn't do any of the 'typical' interview bullshit most companies do.
What we did instead was sit people down and have deep technical conversations about systems they'd worked on in the past. The candidate would explain, in as much detail as they could muster, a system they'd worked on in the past, down to the lowest level details. Usually, they would talk to us for at least 20-30 minutes, then, we (the interviewers) would pose questions, usually starting with the form 'if we changed X, what effect would it have'. Doing interviews in this style make a few things immediately obvious:
1. Did the candidate have a deep, systemic understanding of the system they worked on?
2. Does the candidate have a good mental model for evaluating change in the system?
That's how I conduct interviews, and unsurprisingly, when I get interviewed like that, my success rate is 100%. I don't think I've ever done an interview like that which did not result in an offer.
Anyways, there's some rambling and unsolicited opinions for you :)
Demonstrating competency is always good.
You could have learned this if you were better about collecting requirements. You can tell the interviewer "I'd do it like this for this size data, but I'd do it like this for 100x data. Which size should I design this for?" If they're looking for one direction and you ask which one, interviewers will tell you.
Said another way, how do you have a meaningful conversation about scaling with a person who thinks their application is huge, but in reality only requires a tiny fraction of a single machine? Sometimes, there's such a massive gulf between perception and reality that the only thing to do is chuckle and move on.
It may or may not be related that the places that this happened were always very ethnically monotone with narrow age ranges (nothing against any particular ethnic group, they were all different ethnic monotones)
let’s see how they think and turn this into a paid interview
You're looking for your first DevOps person, so you want someone who has experience doing DevOps. They'll tell you about all the fancy frameworks and tooling they've used to do Serious Business™, and you'll be impressed and hire them. They'll then proceed to do exactly that for your company, and you'll feel good because you feel it sets you up for the future.
Nobody's against it. So you end up in that situation, which even a basic home desktop would be more than capable of handling.
Cost is usually not a huge problem beyond seed stage. Series A-B the biggest problem is growing the customer base so the fixed infra costs become a rounding error. We've built the product and we're usually focused on customer enablement and technical wins - proving that the product works 100% of the time to large enterprises so we can close deals. We can't afford weird flakiness in the middle of a POC.
Another factor I rarely see discussed is bus factor. I've been in the industry for over a decade, and I like to be able to go on vacation. It's nice to hand off the pager sometimes. Using established technologies makes it possible to delegate responsibility to the rest of the team, instead of me owning a little rats nest fiefdom of my own design.
The fact is that if 5k/month infra cost for a core part of the service sinks your VC backed startup, you've got bigger problems. Investors gave you a big pile of money to go and get customers _now_. An extra month of runway isn't going to save you.
I once interviewed with a company that did some machine learning stuff, this was a while back when that typically meant "1 layer of weights from a regression we run overnight every night". The company asked how I had solved the complex problem of getting the weights to inference servers. I said we had a 30 line shell script that ssh'd them over and then mv'd them into place. Meanwhile the application reopened the file every so often. Zero problems with it ever. They thought I was a caveman.
I have recently started using terraform/tofu and ansible to automate nearly all of the devops operations. We are at a point where Claude Code can use these tools and our existing configs to make configuration changes, debug issues by reviewing logs etc. It is much faster at debugging an issue than I am and I know our stuff inside and out.
I am beginning to think that AI will soon force people to rethink their cloud hosting strategy.
Basically discoverability is where shell script fail
Or python. The python3 standard library is pretty capable, and it's ubiquitous. You can do a lot in 50-100 lines (counting documentation) with no dependencies. In turn it's easy to plug into the other stuff.
No, it's lack of documentation and no amount of $$$$/m enterprise AI solutions (R)(TM) would help you if there is no documentation.
And then I got laid off. Now, I've got very few modern frameworks on my resume and I've been jobless for over a year.
I'm feeling a right fool now.
There is something wrong with the industry in chasing fads and group think. It has always been this way. Businesses chased Java in the late 90s, early 00s. They chased CORBA, WSDL, ESB, ERP and a host of other acronyms back in the day.
More recently, Data Lake, Big Data, Cloud Compute, AI.
Most of the executives I have met really have no clue. They just go with what is being promoted in the space because it offers a safety net. Look, we are "not behind the curve!". We are innovating along with the rest of the industry.
Interviews do not really test much for ability to think and reason. If you ran an entire ISP, if you figured out, on your own, without any help, how to shard databases, put in multiple layers of redundancy, caching... well, nobody cares now. You had to do it in AWS or Azure or whatever stack they have currently.
Sadly, I do not think it will ever be fixed. It is something intrinsic to human nature.
Need training and something to show? Contribute to some FOSS project.
If you're willing and able to promote yourself internally, you can make people give a shit, or at least publicly claim they do. That's 260k+ per year, and even big businesses are going to care about that at some level, especially if it's something that can be replicated. Find 10 systems you can do that with, and it's 2.6m+ per year.
But, if you don't want to play the self-promotion game, yeah someone else is going to benefit from your work.
Yep, and a lot more datasets fit entirely into RAM now. Ignoring the recent price spikes for a moment, 128GB of RAM in a laptop is entirely achievable and not even the limit of what is possible. That was a pipe dream in 2014 when computers with only 4GB were still common. And of course for servers the max RAM is much higher, and in a lot of scenarios streaming data off a fast local SSD may be almost as good.
This kind of practice is insidious because early on, they charge $20/month to get started on the first 100mb of log ingestion, and you can have it up and running in 30 seconds with a credit card. Who would turn that down?
Revisit that set up 2 years later and it’s turned into a 60k/y behemoth that no one can unwind
It's just like the systemd people talking about sysvinit. "Eww, shell scripts! What a terrible hack!" says the guy with no clue and no skills.
It's like the whole ship is being steered by noobs.
That's funny. I used to have to clean up the messes caused by systemd's design limitations and flaws, until I built my own distro with a sane init system installed.
Many of the noobs groaning about the indignity of shell scripts don't even realize that they could write init 'scripts' in whatever language they want, including Python (the language these types usually love so much, if they do any programming at all.)
For example, I’ve been dealing with SysV since the early 90s and while it’s gotten better since we no longer have to support the really bizarre Unix variants, my problem with init scripts wasn’t “indignity” but the lack of consistency across distributions and versions, which affects anyone shipping software professionally (“can’t do this easily until $distro upgrades coreutils”), and from an operator’s perspective using Python doesn’t make that better because instead of supporting one consistent thing you’d end up with the subset of features each application team felt like implementing, consistent only to the extent that they care to follow other projects. One virtue of systemd is that having a single common way to specify dependencies, restarts, customization, etc. avoids the ops people having to learn dozens of different variations of the same ideas and especially how to deal with their gaps. A few years back, a data center power outage at one place I worked really highlighted that: the systemd-based servers recovered quickly because they actually had working retries; all of the older stuff using SysV had to be manually reviewed because there were all kinds of problems like races on dependencies like DNS or NFS, retry logic which failed hard after a short period of time, failures because a stale PID file wasn’t removed, or cases where a vendor had simply never implemented retries in their init scripts. While in theory you can handle all of those in SysV most people never did.
After a couple decades of that, a lot of us don’t want to spend time on problems Microsoft solved in Bill Clinton’s first term.
Nothing insurmountable but it meant init files were inevitably much longer than the corresponding Upstart or systemd files despite doing less, and anytime we shipped a new version you had more testing since you had to implement a lot of functionality which is built in to other things.
It's the same thing any corporation should be doing if they were smart, instead of outsourcing everything to RedHat, Microsoft, Google, etc.
Systemd unified and simplified administration across a lot of distributions. Before, it was a hodge podge, and there was a lot of knowledge lost going from rhel to Debian.
I honestly do not like systemd, either. It is okay for managing processes but I wish it didn't spread into everything else in the machine.
Or if it must, could it actually work cohesively across their concepts? Would be nice to have an obvious and easy way to run Quadlet as its own user to isolate further, would be nice to have systemd-sysusers present in /etc/subuid so they can run containers.
I like what they are doing with atomic distros. It would be great to have a single file declarative setup for something like running a containerized reverse HTTP proxy with an isolated user. Instead of "atomic" but you manually edit files in /etc after install.
I've been using this pattern (scripts or code that execute commands against DuckDB) to process data more recently, and the ability to do deep investigations on the data as you're designing the pipeline (or when things go wrong) is very useful. Doing it with a code-based solution (read data into objects in memory) is much more challenging to view the data. Debugging tools to inspect the objects on the heap is painful compared to being able to JOIN/WHERE/GROUP BY your data.
The bottleneck in the example was maxing out disk IO, which I don't think duckdb can help with.
On the other hand, unix sockets combined with socat can perform some real wizardry, but I never quite got the hang of that style.
If the tool of interest works with files (like the UNIX tools do) it fits very well.
If the tool doesn't work with single files I have had some success in using Makefiles for generic processing tasks by creating a marker file that a given task was complete as part of the target.
It’s the same story as always, just it used to be Oracle certified tech, now it’s the AWS tech certified to ensure you pay Amazon.
It looked good on someone’s resume and that was it. They are long gone.
I guarantee those rust projects have spent more time playing with rust and library design than the domain problem they are trying to solve.
The issue is you can run sub tib jobs on a few small/standard instances with better tooling. Spark and Hadoop are for when you need multiple machines.
Dbt and airflow let you represent your data as a DAG and operate on that, which is critical if you want to actually maintain and correct data issues and keep your data transforms timely.
edit: a little surprised at multiple downvotes. My point is, you can run airflow and dbt on small instances, and you can do all your data processing on small instances with tools like duckdb or polars.
But it is very useful to use a tool like dbt that allows you to re-build and manage your data in a clear way, or a tool like airflow which lets you specify dependencies for runs.
After say 30 jobs or so, you'll find that being able to re-run all downstreams of a model starts to payoff.
These hardly exist in practice.
But I get what you mean.
datalake (DuckLake), pipelines (hubspot, stripe, postgres), and dashboards in a single app for $250/mo.
marketing/finance get dashboards, everyone else gets SQL + AI access. one abstraction instead of five, for a fraction of your Snowflake bill.
Also seen strange responses from HN commenters when it's mentioned that bash is large and slow compared to ash and bash is better suited for use as an interactive shell whereas ash is better suited for use as a non-interactive shell, i.e., a scripting shell
I also use ash (with tabcomplete) as an interactive shell for several reasons
It’s not just that, it’s that you better know their specific tech stack to even get hired. It’s a lot of dumb engineering leaders pretending that AWS, Azure and Snowflake are such wildly different ecosystems that not having direct experience in theirs is disqualifying (for pure DE roles, not talking broader sysadmin).
The entire data world is rife with people who don’t have the faintest clue what they’re doing, who really like buzzwords, and who have never thought about their problem space critically.
Yes it is an additional layer, but if your orchestration starts concerning itself with what it is doing then something is wrong. It is not a layer on top of other logic, it is a single layer where you define how to start your tasks, how to tell when something is wrong, and when to run them.
If you don't insist on doing heavy compitations within the airflow worker it is dirt cheap. If it's something that can easily be done in bash or python you can do it within the worker as long as you're willing to throw a minimal amount of hardware at it.
It's great to see this post I wrote years ago still being useful for people.
I agree with many here that the situation is arguably worse in many ways. However, along similar lines, I've been pleased to see a move away from cargo culting microservices (another topic I addressed in a separate post on that site).
To all those helping companies and teams improve performance, keep it up! There is hope!
Thank you very much!
Been re-reading your post multiple times.
You inspired me to port Waters-Series (kind-of streams) to JavaScript to get pipelining for stream processing.
I think its a similar pattern to web dev influencers have convinced everyone to build huge hydrated-spa-framework-craziness where a static site would do.
My advice to get out of this mess:
- Managers, don't ask for specific solutions (spark, react). Ask for clever engineers to solve problems and optimise / track what you vare about (cost, performance etc). You hired them to know best, and they probably do.
- Technical leads, if your manager is saying "what about hyperscale?" You don't have to say "our existing solution will scale forever". It's fine to say, "our pipelines handle dataset up to 20GB, we don't expect to see anything larger soon, and if we do we'll do x/y/z to meet that scale". Your manager probably just wants to know scaling isn't going to crash everything, not that you've optimised the hell out of everything for your excel spreadsheet processing pipeline.
And I immediately asked, "in what capacity?" And the answer was don't-know/doesn't-matter, it's just important that we can say we're using it. I really wish I understood where that was coming from (his manager resume-building? somebody getting a kickback?)
mrjob, the tool mentioned in the article, has a local mode that does not use Hadoop, but just runs on the local computer. That mode is primarily for developing jobs you'll later run on a Hadoop cluster over more data. But, for smaller datasets, that local mode can be significantly faster than running on a cluster with Hadoop. That's especially true for transient AWS EMR clusters — for smaller jobs, local mode often finishes before the cluster is up and ready to start working.
Even so, I bet the author's approach is still significantly faster than mrjob's local mode for that dataset. What MapReduce brought was a constrained computation model that made it easy to scale way up. That has trade-offs that typically aren't worth it if you don't need that scale. Scaling up here refers to data that wouldn't easily fit on disks of the day — the ability to seamlessly stream input/output data from/to S3 was powerful.
I used mrjob a lot in the early 2010s — jobs that I worked on cumulatively processed many petabytes of data. What it enabled you to do, and how easy it was to do it, was pretty amazing when it was first released in 2010. But it hasn't been very relevant for a while now.
By applying some trivial optimizations, like streaming the parsing, I essentially managed to get it to run at almost disk speed (1GB/s on an SSD back then).
Just how much data do you need when these sort of clustered approaches really start to make sense?
Hah, incredibly funny, I remember doing the complete opposite about 15 years ago, some beginner developer had setup a whole interconnected system with multiple processes and what not in order to process a bunch of JSON and it took forever. Got replaced with a bash script + Python!
> Just how much data do you need when these sort of clustered approaches really start to make sense?
I dunno exactly what thresholds others use, but I usually say if it'd take longer than a day to process (efficiently), then you probably want to figure out a better way than just running a program on a single machine to do it.
Quick Python/bash to cleanup data is fine too I suppose and with LLMs, it's easier than ever to write the quick throwaway script.
I think most people used R. Free and great graphing. Though the interactivity of Excel is great for what ifs. I never got R till I took that class. Though RStudio makes R seem like scriptable excel.
R/Python are fast enough for most things though a lot of genomic stuff (Blast alignments etc..) are in compiled languages.
In practice most AWS instances are 10Gbps capped. I have seen ~5Gbps consistently read from GCS and S3. Nitro based images are in theory 100Gbps capable, in practice I've never seen that.
This has, at multiple companies for me, been the cause of surprise incidents, where people were unaware of this fact and were then surprised when the bandwidth suddenly plummeted by 50% or more after a sustained load.
I did not see your comment earlier, but to stay with Chess see https://news.ycombinator.com/item?id=46667287, with ~14Tb uncompressed.
It's not humongous and it can certainly fit on disk(s), but not on a typical laptop.
You really need an enormous amount of data (or data processing) to justify a clustered setup. Single machines can scale up rather quite a lot.
It'll cost money, but you can order a 24x128GB ram, 24x30TB ssd system which will arrive in a few days and give you 3 TB ram, 720 TB (fast) disk. You can go bigger, but it'll be a little exotic and the ordering process might take longer.
If you need more storage/ram than around that, you need clustering. Or if the processing power you get in your single system storage isn't enough, you would need to cluster, but ~ 256 cores of cpu is enough for a lot of things.
The bulk of the data was in big JSON arrays, so you basically consumed the array start token, then used the parser to consume an entire objects which could be turned into a C# object by the deserializer, then you consumed a comma or end array token until you ran out of tokens.
I had to do it like this because DS-es were running into the problem that some of the files didn't fit into memory. The previous approach took 1 hour, involved reading the whole file into memory and parsing it as JSON (when some of the files got over 10GB, even 64GB memory wasnt enough and the system started swapping).
It wasn't fast even before swapping (I learned just how slow Python can be), but then basically it took a day to run a single experiment. Then the data got turned into a dataframe.
I replaced that part of the Python code processing and outputted a CSV which Pandas could read without having to trip through Python code (I guess it has an internal optimized C implementation).
The preprocessor was able to run on the build machines and DSes consumed the CSV directly.
Any large XML document will clobber a program using the in-memory representations, and the solution is to move to XmlReader. System.Text.Json (.NET built-in parsing) has a similar token-based reader in addition to the standard (de)serialization to objects approach.
I've seen so many times that data processing quickly became a bottleneck and source of frustration with Python that stuff needed to be rewritten, that I came to not bother writing stuff in Python in the first place.
You can make Python fast by relying on NumPy and pandas with array programming, but doing so can be quite challenging to format and massage the data so that the things that you want can be expressed as array programming ops, that usually it became too much of a burden for me.
I wish Python was at least as fast as Node (which also can have its own share of performance cliffs)
It's possible that nowadays Python has JITs that improve performance to Java levels while keeping compatibility with most existing code - I haven't used Python professionally in quite a few years.
> native code parsing speedups for most common platforms
Which is to say, roughly analogous to "relying on NumPy". (A well-designed system avoids repeatedly calling from Python to C and prefers to let loops live within the C code; that applies at least as much to tree-like data as array-like data.)
> I wish Python was at least as fast as Node (which also can have its own share of performance cliffs) It's possible that nowadays Python has JITs that improve performance to Java levels while keeping compatibility with most existing code - I haven't used Python professionally in quite a few years.
No guarantees, but have you tried PyPy? It's existed since 2007 and definitely improved over time.
I would say that "performance cliffs" are just endemic to programming. Even in C you find people writing bad algorithms because better ones seem (at least superficially) much harder to write — especially if the good algorithm requires, say, a hash table. (C++ standard library containers definitely ameliorate this effect, but you pay in code complexity, especially where templates are needed.) And on the other hand you sometimes see big improvements from dropping to assembly (cf. ffmpeg).
Anyway, you write a state machine that processes the string in chunks – as you would do with a regular parser – but the difference is that the parser is eager to spit out a stream of data that matches the query as soon as you find it.
The objective is to reduce the memory consumption as much as possible, so that your program can handle an unbounded JSON string and only keep track of where in the structure it currently is – like a jQuery selector.
You can buffer data, or yield as it becomes available before discarding, or use the visitor pattern, and others.
One Python library that handles pretty much all of them, as a place to start learning, would be: https://github.com/daggaz/json-stream
https://learn.microsoft.com/en-us/dotnet/standard/serializat...
I can however say that when I had a job at a major cloud provider optimizing spark core for our customers, one of the key areas where we saw rapid improvement was simply through fewer machines with vertically scaled hardware almost always outperformed any sort of distributed system (abet not always from a price performance perspective).
The real value often comes from the ability to do retries, and leverage left over underutilized hardware (i.e. spot instances, or in your own data center at times when scale is lower), handle hardware failures, ect, all with the ability for the full above suite of tools to work.
A simple equijoin with high cardinality and indexed columns will usually be extremely fast. The same join in a 1:M might be fast, or it might result in a massive fanout. In the case of the latter, if your RDBMS uses a clustering index, and if you’ve designed your schemata to exploit this fact (e.g. a table called UserPurchase that has a PK of (user_id, purchase_id)) can still be quite fast.
Aggregations often imply large amounts of data being retrieved, though this is not necessarily true.
And many important datasets never make it into any kind of database like that. Very few people provide "index columns" in their CSV files. Or they use long variable length strings as their primary key.
OP pertains to that kind of data. Some stuff in text files.
The amount of Python jobs I've had which run fine for several hours and then break with runtime errors, whereas with C# you can be reliably sure that if it starts running it will finish running.
I'm not necessarily the biggest fan of python, but writing a data engineering tool in a non-data engineering focused language seems like a bad decision. Now when the OP leaves the organization is in a much tougher position.
Consider the following table of medical surgeries: date,physician_name, surgery_name,success.
"What are the top 10 most common surgeries?" - easy in bash
"Who are the top physicians (% success) in the last year for those surgeries?" - still easy in bash
"Which surgeries are most affected by physician experience?" - very hard in bash, requires calculating for every surgery how many times that physician had performed that surgery on that day, then compare low and high experience outcomes.
A researcher might see a smooth continuum of increasingly complex questions, but there are huge jumps in computational complexity. At 50gb dataset might be 'bigger' than a 2tb one if you are asking tough questions.
It's easier for a business to say "we use Spark for data processing", than "we build bespoke processing engines on a case by case basis".
(2018, 222 comments) https://news.ycombinator.com/item?id=17135841
(2022, 166 comments) https://news.ycombinator.com/item?id=30595026
(2024, 139 comments) https://news.ycombinator.com/item?id=39136472 - by the same submitter as this post.
1K rows: use excel
1M rows: use pandas/polars
1B rows: use shakti
1T rows: only shakti
Source: https://web.archive.org/web/20230331180931/https://shakti.co...
Plus, they require a bit of reading because they operate on a higher level of abstraction than loops and ifs. You get implicit loops, your fields get cut up automatically, and you can apply regexes simultaneously on all fields. So it's not obvious to the untrained eye.
But you get a lot of power and flexibility on the cli, which enable you to rapidly put together an ad hoc solution which can get the job done or at least serve as a baseline before you reach for the big guns.
It would be interesting to redo the benchmark but with a (much) larger database.
Nowadays the biggest open-data for chess must comes from Lichess https://database.lichess.org, with ~7B games and 2.34 TB compressed, ~14TB uncompressed.
Would Hadoop win here?
The "EMR over S3" paradigm is based on the assumption that the data isn't read all that frequently, 1-10x a day typically, so you want your cheap S3 storage but once in a while you'll want to crank up the parallelism to run a big report over longer time periods.
The compressed data can fit onto a local SSD. Decompression can definitely be streamed.
I could be tempted to do some fun on an NVL72 ;-)
If you want speed, just have your database stored in the same place as your application, locally, rather than hopping across the world to retrieve data that can be located next to the code.
That would probably be the easiest thing to do to get a real measured performance gains.
As other commentators pointed out, computers are extremely powerful. This isn't 1995, you can easily host everything in the same local area and get a very responsive application with very minimal needs to worry about resource constraints.
In which case it makes the analysis a bit less practical, since the main use case I have for fancy data processing tools is when I can’t load a whole big file into memory.
Unix shell pipelines are task-parallel. Every tool gets spun up as its own unix process — think "program" (fork-exec). Standard input and standard output (stdin, stdout) get hooked up to pipes. Pipes are like temporary files managed by the kernel (hand-wave). Pipe buffer size is a few KB. Grep does a blocking read on stdin. Cat writes to stdout. Both on a kernel I/O boundary. Here the kernel can context-switch the process when waiting for I/O.
In the past there was time-slicing. Now with multiple cores and hardware threads they actually run concurrently.
This is very similar to old-school approach to something like multiple threads, but processes don’t share virtual address spaces in the CPU's memory management unit (MMU).
Further details: look up McIlroy's pipeline design.
In that case to get best performance, you’d have to shard your data across a cluster and use mapreduce.
Even in the authors 2014 SSDs multi-core consumer PC world, their aggregate pipeline would be around 2x faster if the work was split across two equivalent machines.
The limit of how much faster distributed computing is comes down to latency more than throughput. I’d not be surprised if this aggregate query could run in 10ms on pre sharded data in a distributed cluster.
The comments here smell of "real engineers use command line". But I am not sure they ever actually worked with analysing data more than using it as a log parser.
Yes Hadoop is 2014.
These days you obviously don't set up a Hadoop cluster. You use the cloud provider service provided (BigQuery or AWS Athena for example).
Or map your data into DuckDB or use polars if it is small.
It really feels that way. Real data analysis involves a lot more than just grepping logs. And the reason to be wary of starting out unprepared for that kind of analysis is that migrating to a better solution later is a nightmare.
I like this one where they put a dataset on 80 machines only then for someone to put the same dataset on 1 Intel NUC and outperform in query time.
https://altinity.com/blog/2020-1-1-clickhouse-cost-efficienc...
Datasets never become big enough…
Just because you don't have experience of these situations, it doesn't mean they don't exist. There's a reason Hadoop and Spark became synonymous with "big data."
The solutions are well known even to many non-programmers who actually have that problem:
There are also sensor arrays that write 100,000 data points per millisecond. But again, that is a hardware problem not a software problem.
Having materialised views increases insert load for every view, so if you want to slice your data in a way that wasn't predicted, or that would have increased ingress load beyond what you've got to spare, say, find all devices with a specific model and year+month because there's a dodgy lot, you'll really wish you were on a DB that can actually run that query instead of only being able to return your _precalculated_ results.
Not only is this a contrived non-comparison, but the statement itself is readily disproven by the limitations basically _everyone_ using single instance ClickHouse often run into if they actually have a large dataset.
Spark and Hadoop have their place, maybe not in rinky dink startup land, but definitely in the world of petabyte and exabyte data processing.
Bane's rule, you don't understand a distributed computing problem until you can get it to fit on a single machine first.