> Optimizing ClickHouse for Intel's ultra-high core count processors
Which is pretty unambiguous.
> Memory optimization on ultra-high core count systems differs a lot from single-threaded memory management. Memory allocators themselves become contention points, memory bandwidth is divided across more cores, and allocation patterns that work fine on small systems can create cascading performance problems at scale. It is crucial to be mindful of how much memory is allocated and how memory is used.
In bioinformatics, one of the most popular alignment algorithms is roughly bottlenecked on random RAM access (the FM-index on the BWT of the genome), so I always wonder how these algorithms are going to perform on these beasts. It's been a decade since I spent any time optimizing large system performance for it though. NUMA was already challenging enough! I wonder how many memory channels these new chips have access to.
Core-to-Core communication across infinity fabric is on the order of 50~100x slower than L1 access. Figuring out how to arrange your problem to meet this reality is the quickest path to success if you intend to leverage this kind of hardware. Recognizing that your problem is incompatible can also save you a lot of frustration. If your working sets must be massive monoliths and hierarchical in nature, it's unlikely you will be able to use a 256+ core monster part very effectively.
One of the most interesting and poorly exploited features of these new Intel chips is that four cores share an L2 cache, so cooperation among 4 threads can have excellent efficiency.
They also have user-mode address monitoring, which should be awesome for certain tricks, but unfortunately like so many other ISA extentions, it doesn't work. https://www.intel.com/content/www/us/en/developer/articles/t...
For BioInformatics specifically, I’ve just finished benchmarking Intel SPR 16-core UMA slices against Nvidia H100, and will try to extend them soon: https://github.com/ashvardanian/StringWa.rs
So you could replay the entire history of the book just by stepping through the rows.
If I were forced at gunpoint to choose one of the type or name, "obviously" I would also choose type.
Do these things have AVX512? It looks like some of the Sierra Forest chips do have AVX512 with 2xFMA…
That’s pretty wide. Wonder if they should put that thing on a card and sell it as a GPU (a totally original idea that has never been tried, sure…).
Intel split their server product line in two:
* Processors that have only P-cores (currently, Granite Rapids), which do have AVX512.
* Processors that have only E-cores (currently, Sierra Forest), which do not have AVX512.
On the other hand, AMD's high-core, lower-area offerings, like Zen 4c (Bergamo) do support AVX512, which IMO makes things easier.
On Zen4 and Zen4c the register is 512 bits wide. However, internally, many “datapaths” (execution units, floating-point units, vector ALUs, etc.) are 256 bits wide for much of the AVX-512 functional units…
Zen5 is supposed to be different, and again, I wrote the kernels for Zen5 last year, but still have no hardware to profile the impact of this implementation difference on practical systems :(
On Zen 4 and Zen 4c, for most vector instructions the vector datapaths have the same width as in Intel's best Xeons, i.e. they can do two 512-bit instructions per clock cycle.
The exceptions where AMD has half throughput are the vector load and store instructions from the first level cache memory and the FMUL and FMA instructions, where the most expensive Intel Xeons can do two FMUL/FMA per clock cycle while Zen 4/4c can do only 1 FMUL/FMA + 1 FADD per clock cycle.
So only the link between the L1 cache and the vector registers and also the floating-point multiplier have half-width on Zen 4/4c, while the rest of the datapaths have the same width (2 x 512-bit) on both Zen 4/4c and Intel's Xeons.
The server and desktop variants of Zen 5/5c (and also the laptop Fire Range and Strix Halo CPUs) double the width of all vector datapaths, exceeding the throughput of all past or current Intel CPUs. Only the server CPUs expected to be launched in 2026 by Intel (Diamond Rapids) are likely to be faster than Zen 5, but by then AMD might also launch Zen 6, so it remains to be seen which will be better by the end of 2026.
SimSIMD (inside USearch (inside ClickHouse)) already has those SIMD kernels, but I don’t yet have the hardware to benchmark :(
Way back in the day, I built and ran the platform for a business on Pentium grade web & database servers which gave me 1 "core" in 2 rack units.
That's 24 cores per 48 unit rack, so 288 cores would be a dozen racks or pretty much an entire aisle of a typical data center.
I guess all of Palo Alto Internet eXchange (where two of my boxen lived) didn't have much more than a couple of thousand cores back in 98/99. I'm guessing there are homelabs with more cores than that entire PAIX data center had back then.
If you're doing a lot of loading and storing, these E-core chips are probably going to outperform the chips with huge cores because they will be idling a lot. For CPU-bound tasks, the P-cores will win hands down.
CPU on PCIe card seems like it matches with the Intel Xeon Phi... I've wondered if that could boost something like an Erlang mesh cluster...
So is 2 GB of storage.
And 2K of years.
https://www.titancomputers.com/Titan-A900-Octane-Dual-AMD-EP...
320 cores starts at $28,000.. $34k with 1TB of memory..
I like duckdb, but clickhouse seems more focused on large scale performance.
I just thought that the article is written from the point of view of a single person, but has multiple authors, which is a bit weird. Did I misunderstood something?
It seems today's Intel CPU can replace yesteryear's data center.
May someone can try for fun running 1000 Red Hat Linux 6.2 in parallel on one CPU, like it's year 2000 again.