It isn't that AMD has better AVX-512 support, which would be an impressive upset on it's own. Instead, it is only that AMD has AVX-512 on consumer CPUs, because Intel walked away from their own investment.
They keep repeating the same mistakes all the way back to https://en.wikipedia.org/wiki/Intel_iAPX_432
It's hard to understand how they could have played that particular hand more badly. Even a few years on, I'm missing Optane drives because there is still no functional alternative. If they just held out a bit longer, they would have created a set of enterprise customers who would still be buying the things in 2040.
You need scratch space that's resilient to a power outage? An NVDIMM is faster and cheaper. You need fast storage? Flash keeps getting faster and cheaper. Optane was squeezed from both sides and could never hope to generate the volume needed to cut costs.
So now imagine that you are at Intel deciding what initiatives to fund. The company is in trouble and needs to show some movement out of the red, preferably quickly. It also lost momentum and lost ground to competitors, so it needs to focus. What do you do? You kill all the side projects that will never make much money. And of course you kill a lot of innovation in the process, but how would you justify the alternative?
First, NVMe is a protocol to access block storage. It can be used to access any kind of block device, Optane, SSD, NVDIMM, virtual storage on EC2, etc. So it's true that the protocol is the same (well, not quite - more on this in a bit), but that's like saying a server is the same as an iPhone because they can both speak TCP/IP.
What was the "more in a bit" bit? Persistent memory (PMEM) devices like NVDIMMs and Optane can usually speak two protocols. They can either act as storage, or as memory expansion. But this memory also happens to be non-volatile.
This was sold as a revolution, but it turned out that it's not easy for current operating systems and applications to deal with memory with vastly different latencies. Also it turns out that software is buggy, and being able to lose state by rebooting is useful. And so Optane in memory mode never really caught on, and these devices were mostly used as a storage tier. However: look up MemVerge.
So you are right that it turned out to be a faster SSD, but the original promise was a lot more. And here comes the big problem: because Optane was envisioned as a separate kind of product between RAM and SSD, the big price differential could be justified. If it's just a faster SSD - well, the market has spoken.
I guess many on HN are software developers looking at Optane.
In reality Optane was simply not cost effective. Optane came at a time when DRAM cost / GB was at its peak, the idea to developers could have slower DRAM that is non-volatile is great, until they realise slower DRAM causes CPU performance regression. Optane Memory, even on its roadmap in future product will always effectively be another layer between DRAM and NAND ( Storage ). And they could barely make profits when DRAM was at its peak. I dont think people realise there is near "4x" price difference between the height of DRAM price in ~2016 ish to ~2023.
In terms of Optane Storage, it was again at NAND's cost /GB peak and it was barely completing or making profits. Most would immediately point out it has lower latency and better QD1 performance. But Samsung showcased with Z-NAND, which is specifically tuned SLC NAND you can get close enough performance, far higher bandwidth and QD32 results, while using much lower power. And has a reliable Roadmap that is alongside the NAND development. Even Samsung stopped development of Z-NAND in 2023.
The truth is the market isn't interested in Optane enough at the price/ performance and feature it was offering. And Intel's execution for Optane, they have either over promised ( as they do in that era ) and fail to deliver on time or they are basically lying about the potential. And fail to bring down cost of fabbing it, which they blame Micron but in reality it is all on Intel.
The industry have also repeatedly stated they are not interested in a technology that is single sourced by either Intel or Micron. Unlike NAND and DRAM.
Intel was giving away Optane and pushing to Facebook and other Hyperscaler. But even then they couldn't even fill the minimum order for Micron and had to pay hundreds of millions per year for empty fabs.
Do we need a second "killed by google"?
To companies like Intel or Google anything below a few hundred million users is a failure. Had these projects been in a smaller company, or been spun out, they'd still be successful and would've created a whole new market.
Maybe I'm biased — a significant part of my career has been working for German Mittelstand "Hidden Champions" — but I believe you don't need a billion customers to change the world.
Which begs the question, why isn’t anyone else stepping into this gap? Is the technology heavily patented?
Case No. 12-43166 is what killed Optane.
Or, in a manner of speaking, Intel being Intel killed Optane.
Optane did not have a promising future. The $/GB gap between 3D XPoint memory and 3D NAND flash memory was only going to keep growing. Optane was doomed to only be appealing to the niche of workloads where flash memory is too slow and DRAM is too expensive. But even DRAM was increasing in density faster than 3D XPoint, and flash (especially the latency-optimized variants that are still cheaper than 3D XPoint) is fast enough for a lot of workloads. Optane needed a breakthrough improvement to secure a permanent place in the memory hierarchy, and Intel couldn't come up with one.
Except Intel deliberately made AVX 512 a feature exclusively available to Xeon and enterprise processors in future generations. This backward step artificially limits its availability, forcing enterprises to invest in more expensive hardware.
I wonder if Intel has taken a similar approach with Arc GPUs, which lack support for GPU virtualization (SR-IOV). They somewhat added vGPU support to all built-in 12th-14th Gen chips through the i915 driver on Linux. It’s a pleasure to have graphics-acceleration in multiple VMs simultaneously, through the same GPU.
Thing is, they could have killed it by 1998, without ever releasing anything, that would have killed the other architectures it was trying to compete with. Instead they waited until 2020 to end support.
What the VLIW of Itanium needed and never really got was proper compiler support. Nvidia has this in spades with CUDA. It's easy to port to Nvidia where you do get serious speedups. AVX-512 never offered enough of a speedup from what I could tell, even though it was well supported by at least ICC (and numpy/scipy when properly compiled)
This is kinda under-selling it. The fundamental problem with statically-scheduled VLIW machines like Itanium is it puts all of the complexity in the compiler. Unfortunately it turns out it's just really hard to make a good static scheduler!
In contrast, dynamically-scheduled out-of-order superscalar machines work great but put all the complexity in silicon. The transistor overhead was expensive back in the day, so statically-scheduled VLIWs seemed like a good idea.
What happened was that static scheduling stayed really hard while the transistor overhead for dynamic scheduling became irrelevantly cheap. "Throw more hardware at it" won handily over "Make better software".
Is the latter part true? AFAIK most of modern CPU die area and power consumption goes towards overhead as opposed to the actual ALU operations.
So it's clearly true that the transistor overhead of dynamic scheduling is cheap compared to the (as-yet unsurmounted) cost of doing static scheduling for software that doesn't lend itself to that approach. But it's probably also true that dynamic scheduling is expensive compared to ALUs, or else we'd see more GPU-like architectures using dynamic scheduling to broaden the range of workloads they can run with competitive performance. Instead, it appears the most successful GPU company largely just keeps throwing ALUs at the problem.
perhaps Intel really wanted it to work and killing other architectures was only a side effect?
I would argue that it was bound to happen one way or another eventually, and Itanium just happened to be a catalyst for the extinction of nearly all alternatives.
High to very high performance CPU manufacturing (NB: the emphasis is on the manufacturing) is a very expensive business, and back in the 1990's no-one was able (or willing) to invest in the manufacturing and commit to the continuous investment in keeping the CPU manufacturing facilities up to date. For HP, SGI, Digital Equipment, SUN, and IBM, a high performance RISC CPU was the single most significant enabler, yet not their core business. It was a truly odd situation where they all had a critical dependency on CPU's, yet none of them could manufacture them themselves and were all reliant on a third party[0].
Even Motorola that was in some very serious semicondutor business could not meet the market demands[1].
Look at how much it costs Apple to get what they want out of TSMC – it is tens of billions of dollars almost yearly, if not yearly. We can see very well today how expensive it is to manufacture a bleeding-edge, high-performing CPU – look no further than Samsung, GlobalFoundries, the beloved Intel, and many others. Remember the days when Texas Instruments used to make CPU's? Nope, they don't make them anymore.
[0] Yes, HP and IBM used to produced their own CPU's in-house for a while, but then that ceased as well.
[1] The actual reason why Motorola could not meet the market demand was, of course, an entirely different one – the company management did not consider the CPU's to be their core business as they primarily focused on other semiconductor products and on defence, which left the CPU production in an underinvested state. Motorola could have become a TSMC if they could see the future through a silicon dust shroud.
wdym
Case No. 12-43166 is what finally killed Optane.
Original: 18 GB/s
AVX2: 20 GB/s
AVX512: 21 GB/s
This is an AMD CPU, but it's clear that the AVX512 benefits are marginal over the AVX2 version. Note that Intel's consumer chips do support AVX2, even on the E-cores.
But there's more to the story: This is a single-threaded benchmark. Intel gave up AVX512 to free up die space for more cores. Intel's top of the line consumer part has 24 cores as a result, whereas AMD's top consumer part has 16. We'd have to look at actual Intel benchmarks to see, but if the AVX2 to AVX512 improvements are marginal, a multithreaded AVX2 version across more cores would likely outperform a multithreaded AVX512 version across fewer cores. Note that Intel's E-cores run AVX2 instructions slower than the P-cores, but again the AVX boost is marginal in this benchmark anyway.
I know people like to get angry at Intel for taking a feature away, but the real-world benefit of having AVX512 instead of only AVX2 is very minimal. In most cases, it's probably offset by having extra cores working on the problem. There are very specific workloads, often single-threaded, that benefit from AVX-512, but on a blended mix of applications and benchmarks I suspect Intel made an informed decision to do what they did.
So to force true AVX2 the benchmark would have to be ran with `DOTNET_EnableAVX512F=0` which I assume is not the case here.
[0]: https://devblogs.microsoft.com/dotnet/performance-improvemen...
[1]: https://devblogs.microsoft.com/dotnet/performance-improvemen...
Look at any existing heavily multithreaded benchmark like Blender rendering. The E-cores are so weak that it just about takes 2 of them to match the performance of an AMD core. If the only difference was AVX512 support then yeah, 24 AVX2 cores would beat 16 AVX-512 cores. But that's not the only difference, not even close.
That's not to say a 24 core Core 9 Ultra Whatever would be slower than a 16 core 9950X in this workload. Just that the E-cores are kinda shit, especially in the wonky counts Intel is using (too many to just be about power efficiency, too few to really offset how slow they are)
That's not "weak". If you look at available die-shot analyses, the E-cores are tiny compared to the P-cores, they take up a lot less than half in area and even less in power. P-cores are really only useful for the rare pure single-threaded workload, but E-cores will win otherwise.
AMD has already been shipping AVX-512 in their consumer processors for longer than Intel did.
> A bit surprisingly the AVX2 parser on 9950X hit ~20GB/s! That is, it was better than the AVX-512 based parser by ~10%, which is pretty significant for Sep.
They fixed it, that's the whole point, but I think there's evidence that AVX-512 doesn't actually benefit consumers that much. I would be willing to settle for a laptop that can only parse 20GB/s and not 21GB/s of CSV. I think vector assembly nerds care about support much more than users.
Edit: they do make use of ternary logic to avoid one or operation, which is nice. Basically (a | b | c) | d is computed using `vpternlogd` and `vpor` resp.
You can't claim this when you also do a huge hardware jump
Then if we take 0.9.0 on previous hardware (13088) and add the 17%, it's 15375. Version 0.1.0 was 7335.
So... 15375/7335 -> a staggering 2.1x improvement in just under 2 years
Higher x -> lower y -> more CPU for my actual workload.
Like, sure, I can give you an application server with faster disks and more memory and you or me are certainly capable of implementing an application server that could load the data from disk faster than all of that. And then we build caching to keep the hot data in memory, because that's faster.
But then we've spent very advanced development resources to build a relational database with some application code at the edge.
This can make sense in some high frequency trading situations, but in many more mundane web-backends, a chunky database and someone capable of optimizing stupid queries enable and simplify the work of a much bigger number of developers.
I did once use a system where the network bandwidth was in the same ballpark as the memory bandwidth, which might not be surprising for some of the real HPC-heads here but it surprised me!
Multiple cores decompressing LZ4 compressed data can achieve crazy bandwidth. More than 5 GB/s per core.
Well, they did. Personally, I find it an interesting way of looking at it, it's a lens for the "real performance" one could get using this software year over year. (Not saying it isn't a misleading or fallacious claim though.)
Straight to the trash with this post.
Folks should check out https://github.com/dathere/qsv if they need an actually fast CSV parser.
5950x is Zen 3
9950x is Zen 5
- Someone decides on CSV because it's easy to produce and you don't have that much data. Plus it's easier for the <non-software people> to read so they quit asking you to give them Excel sheets. Here <non-software people> is anyone who has a legit need to see your data and knows Excel really well. It can range from business types to lab scientists.
- Your internal processes start to consume CSV because it's what you produce. You build out key pipelines where one or more steps consume CSV.
- Suddenly your data increases by 10x or 100x or more because something started working: you got some customers, your sensor throughput improved, the science part started working, etc.
Then it starts to make sense to optimize ingesting millions or billions of lines of CSV. It buys you time so you can start moving your internal processes (and maybe some other teams' stuff) to a format more suited for this kind of data.
Sometimes using something standardized is just worth it though.
I really don't get, though, why people can't just use protocol buffers instead. Is protobuf really that hard?
For better or worse, CSV is easy to produce via printf. Easy to read by breaking lines and splitting by the delimiter. Escaping delimiters part of the content is not hard, though often added as an afterthought.
Protobuf requires to install a library. Understand how it works. Write a schema file. Share the shema to others. The API is cumbersome.
Finally to offer this mutable struct via setter and getter abstraction, with variable length encoded numbers, variable length strings etc. The library ends up quite slow.
In my experience protobuf is slow and memory hungry. The generated code is also quite bloated, which is not helping.
See https://capnproto.org/ for details from the original creator of protobuf.
Is CSV faster than protobuf? I don't know, and I haven't tested. But I wouldn't be surprised if it is.
Based on the amount of software I seen producing broken CSV or can't parse (more-or-less) valid CSV, I don't think that is true.
It seems to be easy, because just printf("%s,%d,%d\n", ...) but it is full of edge cases most programmers don't think about.
I’d love to pass parquet data around, or SQLite dbs, or something else, but that requires dedicated support from other teams upstream/downstream.
Everyone and everything supports CSV, and when they don’t they can hack a simple parser quickly. I know that getting a CSV parser right for all the edge cases is very hard, but they don’t need to. They just need to support the features we use. That’s simple and quick and everyone quickly moves on to the actual work of processing the data.
data = (uint32_t *)read(f);
Or
data = struct.unpack...
Sounds like you're dealing with more heavily formatted or variably formatted data that benefits from more structure to it
e.x. I know in .NET space, MessagePack is usually faster than proto, I think similar is true for JVM. Main disadvantage is there's not good schema based tooling around it.
If your data is coming from a source you don’t own, it’s likely to include data you don’t need. Maybe there’s 30 columns and you only need 3 - or 200 columns and you only need 1.
Enterprise ETL is full of such cases.
Developers: hey, let's hack everything XML had back onto JSON except worse and non-standardized. Because it turns out you need those things sometimes!
So 21 GB/s would be solely algos talking to algos... Given all the investment in the algos, surely they don't need to be exchanging CSV around?
Imagine you want to replace CSV for this purpose. From a purely technical view, this makes total sense. So you investigate, come up with a better standard, make sure it has all the capabilities everyone needs from the existing stuff, write a reference implementation, and go off to get it adopted.
First place you talk to asks you two questions: "Which of my partner institutions accept this?" "What are the practical benefits of switching to this?"
Your answer to the first is going to be "none of them" and the answer to the second is going to be vague hand-wavey stuff around maintainability and making programmers happier, with maybe a little bit of "this properly handles it when your clients' names have accent marks."
Next place asks the same questions, and since the first place wasn't interested, you have the same answers....
Replacing existing standards that are Good Enough is really, really hard.
Depends on the distribution of numbeds in the sataset. It's quite common to have small numbers. For these text is a more efficient representation compared to binary, especially compared to 64-bit or larger binary encodings.
CSV wouldn't even be considered.
Yes, but the consequences of these decisions are worth much more. You attach an ID to the user, and an ID to the transaction. You store the location and time where it was made. Ect.
The speed of human decision has basically 0 role here, as it doesn't with messaging generally, there is way more to companies than just direct keyboard-to-output link.
And non coders use proprietary software, which usually has an export into CSV or XLS to be compatible with Microsoft Office.
I do not think there is an actual explanation besides ignorance, laziness or "it works".
Nice work!
https://learn.microsoft.com/en-us/dotnet/standard/simd
Tanner Gooding at Microsoft is responsible for a lot of the developments in this area and has some decent blogposts on it, e.g.
https://devblogs.microsoft.com/dotnet/dotnet-8-hardware-intr...
- What format exactly is it parsing? (eg. does the dialect of CSV support quoted commas, or is the parser merely looking for commas and newlines)?
- What is the parser doing with the result (ie. populating a data structure, etc)?
Now someone might counter and say that I should just read the README.MD, but then that suspicion simply turns out to be true: They don't actually do any escaping or quoting by default, making the quoted numbers an example of heavily misleading advertising.
Otherwise agree, if you don't do escaping (a.k.a. "quoting", the same thing for CSV), you are not implementing it correctly. For example, if you quote a line break, in RFC 4180, this line break will be in that quoted string, but if you don't need to handle that, you can implement CSV parsing much faster (proper handling line break with quoted string requires 2-pass approach (if you are going to use many-core) while not handling it at all can be done with 1-pass approach). I discussed about this detail in https://liuliu.me/eyes/loading-csv-file-at-the-speed-limit-o...
As an example of how not to do it: XML can be assumed a standard, but I cannot afford to read it. DIN/ISO is great for manufacturing in theory, but bad for zero-cost of initial investment like IT.
The HDF5 format is very good and allows far more structure in your files, as well as metadata and different types of lossless and lossy compression.
HDF5 gives you a great way to store such data.
heh, do it again with mawk.
Excel .xls files are limited to 65,536 rows and 256 columns.
Putting something out so manager stops asking you 20 questions about the data is a double edged sword though. Those people can hallucinate more than a pre-Covid AI engine. Grafana is just weird enough that people would rather consume a chart than try to make one, then you have some control over the acid trip.
It is an interesting benchmark anyway.