AmpereOne: Cores Are the New MHz
144 points
20 days ago
| 17 comments
| jeffgeerling.com
| HN
cbmuser
20 days ago
[-]
Really a pity that Oracle killed off SPARC. They already had 32-core CPUs almost a decade ago but Oracle never really understood the value that SPARC and Solaris brought to the table.
reply
icedchai
20 days ago
[-]
I was a huge Sun fan for most of the 90's and had several Sun systems at home. By the early 2000's, Sun was no longer competitive. Between Intel and Linux, the writing was on the wall. I haven't seen a Sun system used in production since 2003.
reply
shrubble
20 days ago
[-]
They are still being used in telecom. Not the new SPARC M7 series, but what was shipping in 2004-2005 timeframe is still in the rack and doing stuff (which is pretty crazy).
reply
elcritch
20 days ago
[-]
Bit of a pity now that the most advanced process nodes would be available via TSMC.

Solaris seemed to be much more solid than linux overall.

reply
acdha
19 days ago
[-]
Solaris was more stable in the mid-90s but that advantage had flipped by the turn of the century. It was an order of magnitude more expensive to support, especially at scale because Linux was far ahead on packaging and configuration management by 1998 or so, which in practice meant that no two Solaris boxes were the same unless your organization was willing to spend a lot of money on staffing.

Solaris 10 was intended as a catch-up release but the package manager took years to mature. I remember getting racks of systems in the mid-2000s where simply running the updater on brand new servers rendered them unbootable. By then the writing was on the wall.

SPARC was similar: the hardware had some neat aspects but what they really needed was the equivalent of a first-class LLVM backend now. Statistically nobody was hand-tuning assembly code to try to get it to be competitive with x86, especially since that meant not only your own code but most other libraries. The reason ARM is doing so well now is that two decades where the most popular computing devices are phones means that you can just assume that anything popular will compile and run well on ARM.

reply
simtel20
19 days ago
[-]
Yep packaging was a nightmare. The fix for a bug in cron could require you to apply a kernel patch and reboot. But not if you had already applied a different kernel patch beforehand for some specific hardware, but which wasn't available except for particular hardware. But if you just install the patch it may succeed! But it didn't really apply, because the solver that figures out your patches ran separately and doesn't run fast enough to be part of the patch process but the patch has its own script that checks whatever separately to determine that. Oh and if you wait a week your process may break because the -20 version of the patch doesn't fix your problem anymore but it superceded the -12 version you were using anyway. But whichever one you apply, you'll have to reboot.

Like, to update cron. Even though through 3 versions of the same patch id, it does and does not fix cron.

Pure insanity with no product focus on user experience.

reply
mrbluecoat
20 days ago
[-]
Ironically, Oracle seems to be the only cloud compute offering Ampere currently.
reply
geerlingguy
20 days ago
[-]
Azure seems to be offering Ampere-based offerings[1], as well as Hetzner[2].

[1] https://azure.microsoft.com/en-us/blog/azure-virtual-machine...

[2] https://www.hetzner.com/press-release/arm64-cloud/

reply
mrbluecoat
20 days ago
[-]
Oh cool, I thought they had discontinued them.
reply
UltraSane
20 days ago
[-]
SPARC was getting quite slow compared to other architectures and it would have cost a fortune to keep it competitive.
reply
Tuna-Fish
20 days ago
[-]
... SPARC was terrible. Really just utter excrement. Having a lot of cores is not a good thing when the cores are too slow. This is an interesting CPU because the cores are actually usably fast.

Nearly all programs have a constant memory usage regardless of how fast the core running them is. If you deliberately make your cores half the speed, and double the amount of them, you have approximately doubled the cost of the memory you need. In aggregate, memory already costs so much more than CPUs that this is rarely if ever useful, even if it meant your cpu is free.

The starting point of a design that tiles a lot of cores just needs to be ones of the fastest cores available, or it is not commercially viable.

reply
loudmax
20 days ago
[-]
I found this part particularly interesting:

> Also, with 512 gigs of RAM and a massive CPU, it can run a 405 billion parameter Large Language Model. It's not fast, but it did run, giving me just under a token per second.

If you're serious about running LLMs and you can afford it, you'll of course want GPUs. But this might be a relatively affordable way to run really huge models like Llama 405B on your own hardware. This could be even more plausible on Ampere's upcoming 512-core CPU, though RAM bandwidth might be more of a bottleneck than CPU cores. Probably a niche use case, but intriguing.

reply
revnode
20 days ago
[-]
It's really slow. Like, unusably slow. For those interested in self-hosting, this is a really good resource: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...
reply
johnklos
20 days ago
[-]
You know, there's nothing wrong with running a slow LLM.

For some people, they lack the resources to run an LLM on a GPU. For others, they want to try certain models without buying thousands of dollars of equipment just to try things out.

Either way, I see too many people putting the proverbial horse before the cart: they buy a video card, then try to fit LLMs in to the limited VRAM they have, instead of playing around, even if at 1/10th the speed, and figuring out which models they want to run before deciding where they want to invest their money.

One token a second is worlds better than running nothing at all because someone told you that you shouldn't or can't because you don't have a fancy, expensive GPU.

reply
zozbot234
20 days ago
[-]
> For some people, they lack the resources to run an LLM on a GPU.

Most people have a usable iGPU, that's going to run most models significantly slower (because less available memory throughput, and/or more of it being wasted on padding, compared to CPU) but a lot cooler than the CPU. NPU's will likely be a similar story.

It would be nice if there was an easy way to only run the initial prompt+context processing (which is generally compute bound) on iGPU+NPU, but move to CPU for the token generation stage.

reply
elcritch
20 days ago
[-]
Change your interface to the LLM to email. Then you're just sending emails and get your answer back in 15 min. For many cases that'd be useful.
reply
talldayo
20 days ago
[-]
> One token a second is worlds better than running nothing at all because someone told you that you shouldn't or can't because you don't have a fancy, expensive GPU.

It's a minuscule pittance, on hardware that costs as much as an AmpereOne.

reply
cat5e
20 days ago
[-]
Great point
reply
zozbot234
20 days ago
[-]
It's not "really slow" at all, 1 tok/sec is absolutely par for the course given the overall model size. The 405B model was never actually intended for production use, so the fact that it can even kinda run at speeds that are almost usable is itself noteworthy.
reply
geerlingguy
20 days ago
[-]
It's a little under 1 token/sec using ollama, but that was with stock llama.cpp — apparently Ampere has their own optimized version that runs a little better on the AmpereOne. I haven't tested it yet with 405b.
reply
lostmsu
19 days ago
[-]
This resource looks very bad to me as they don't check batched inference at all. This might make sense now when most people a just running single query at once, but pretty soon almost everything will be running queries in parallel to take advantage of the compute.
reply
menaerus
19 days ago
[-]
How do you run multiple queries from multiple clients simultaneously on the same HW without affecting each other context?
reply
lostmsu
18 days ago
[-]
It depends on the framework. Here's a LlamaSharp example: https://github.com/SciSharp/LLamaSharp/blob/master/LLama.Exa...
reply
menaerus
18 days ago
[-]
My question wasn't about how to run multiple queries against the LLM but rather how is it even possible from transformer architecture PoV to have a single LLM hosting multiple and different end clients. I'm probably missing something but can't figure that out yet.
reply
lostmsu
18 days ago
[-]
If you have a branchless program, you can execute the same step of the program on multiple different inputs. https://en.wikipedia.org/wiki/SIMD
reply
worik
20 days ago
[-]
> this is a really good resource: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...

Yes, it is.

But it has not been updated for seven months. Do things change so slowly?

reply
EVa5I7bHFq9mnYK
20 days ago
[-]
I would never chatgpt my code because I don't want to send it to Microsoft. Slow is better than nothing.
reply
MrDrMcCoy
20 days ago
[-]
Bummer that they have no stats for AMD, Intel, Qualcomm, etc (C|G|N|X)PUs.
reply
aldonius
20 days ago
[-]
> I'm also happy to see systems like this that run without exotic water cooling and 240V power.

Always funny to see 240V power described as exotic (or at least exotic-adjacent). It's the standard across practically the entire world except North and Central America!

reply
remram
20 days ago
[-]
They actually distribute 240V in USAmerica, one neutral wire and two opposite-phase 120V wires. Outlets in homes are 120V but wiring 240V is not difficult.

https://en.wikipedia.org/wiki/Split-phase_electric_power

This is different from Europe where we usually distribute 3 phases 130V.

reply
just6979
19 days ago
[-]
> but wiring 240V is not difficult.

Most homes don't have more than a few 240v circuits, and they are usually always dedicated to a single device. My house has 4: range, oven, drier, water heater, all with only a single outlet. The concept of wiring split-phase 240V is not difficult, but getting 4-conductor 12-gauge wiring run to a new location in an existing building is not exactly easy.

So, the power delivery type/format _isn't_ exotic, but the ability to use it in arbitrary locations in a [residential] building, compared to 120V being in every room (though perhaps not 120V/20A in every room)

reply
BizarroLand
16 days ago
[-]
I have 4 a well, but in my case it's Range/oven, Dryer, Water heater, and a 240v outlet in my garage that would be great for either recharging an electric vehicle or powering high voltage tools like a welder/plasma cutter.
reply
ssl-3
19 days ago
[-]
Sure, but wiring 240v is still not difficult -- even in the States. The physical act of wiring 240v is on par with that of wiring 120v.
reply
ta988
20 days ago
[-]
3 phases 230v you mean?
reply
lioeters
19 days ago
[-]
> In Europe, three-phase 230/400 V is most commonly used. However, 130/225 V, three-wire, two-phase electric power discontinued systems called B1 are used to run old installations in small groups of houses when only two of the three-phase high-voltage conductors are used.
reply
ta988
18 days ago
[-]
I know how to wikipedia too! They said usually, not "in some really rare cases"
reply
geerlingguy
20 days ago
[-]
To be clear the water cooling part is what I meant to be 'exotic'. 240V is pretty prevalent in datacenters or network closets when specced, but is rarely seen outside of large appliance installs in US homes and small offices.
reply
UltraSane
20 days ago
[-]
Every home in the US with an electric stove or water heater is using 240v electricity. It isn't really very exotic. 3 phase power to homes IS very exotic though and I'm not sure why it would be useful in homes.
reply
kiney
20 days ago
[-]
3 phase power is near universal inncentral and northern europe. And it's useful for powering saunas, electric warm water or bigger tools
reply
UltraSane
20 days ago
[-]
why? The benefits of 3-phase power are really only used by factories with lots of large electric motors. 3 phase has no benefit for resistance heating.
reply
korhojoa
20 days ago
[-]
You can move more power per conductor with three conductors than with two.

As a sidenote, my apartment has a 3x63A main fuse. Three-phase is everywhere here. Really convenient for ev charging too.

reply
UltraSane
19 days ago
[-]
Delivering 100 amps over 240 volt is not an issue.
reply
worthless-trash
20 days ago
[-]
And welding .. nice when you have to use it.
reply
ssl-3
19 days ago
[-]
In Norway, I understand that electric ranges are commonly cabled up with 3-phase wire that's about the size of a not-special extension cord here in the States. They accomplish this by using 400v.

Meanwhile, also here in the States: The wiring for my very normal (240v) electric range is the size of a baby's arm and might be even stiffer. (That last claim may be a bit hyperbolic: I've bent at least my share of 6/3 wire, but I don't think I've ever bent a baby's arm.)

reply
UltraSane
18 days ago
[-]
The 400v is the important part. Not common at all in US homes but is very common in US factories.
reply
methyl
20 days ago
[-]
Charging electric vehicles.
reply
unscaled
20 days ago
[-]
That would actually be 230V in most countries to be nitpicky. And the split-phase nature of 240V in the US (or 200V here in Japan) does make it somewhat more exotic. I guess building codes may be more expensive when you've got two hot wires?
reply
drmpeg
20 days ago
[-]
Hot Chips 2024 talk on AmpereOne.

https://www.youtube.com/watch?v=kCXcZf4glcM

reply
ksec
20 days ago
[-]
I worried that AmpereOne isn't competitive with ARM core design. Its 512 CPU is only equivalent to x86-64 / Zen 6 256 Core with 2 threads. Which means AmpereOne may still not be competitive against AMD Zen6c unless they have their Core IPC massively improved.

But other than that we are only 2 years away from getting this into our hand a single server / node with 512 vCPU. Or may be 1024 vCPU if we do Dual Socket. Along with PCI-e 6.0 SSD. We can rent one or two of these server and forget about scaling issues for 95%+ of most use cases.

Not to mention nearly all dynamic languages has gotten a lot faster with JIT.

reply
znpy
20 days ago
[-]
Weird, but this makes me thing X86-64 might actually be better?

It isn't mentioned anywhere in __this__ article, but the power draw of that chip is 276W. I got this from Phoronix [1]:

> The AmpereOne A192-32X boasts 192 AmpereOne cores, a 3.2GHz clock frequency, a rated 276 Watt usage power

Which is interesting because it's almost half of AMD's 192-core offering [2].

Why is this interesting? the AMD offering draws a little less than double the wattage but has hyper-threading (!!!) meaning you get 384 threads... So this means that AMD is essentially on par with ARM cpus (at least with Ampere's ARM cpus) in terms of power efficiency... Maybe a little better.

I'd be more inclined to think that AMD is the king/queen of power efficiency in the datacenter rather than ARM/Ampere [3].

notes:

[1]: https://www.phoronix.com/review/ampereone-a192-32x

[2]: https://www.youtube.com/watch?v=S-NbCPEgP1A

[3]: regarding Graviton/Axiom or "alleged" server-class apple silicon... Their power draw is essentially irrelevant as they're all claim that cannot be tested and evaluated independently, so they don't count in my opinion.

reply
jsheard
20 days ago
[-]
Hyperthreading doesn't even get close to doubling actual performance, it depends on the workload but AMDs Zen5 gains about 15% from HT on average according to Phoronix's benchmarks.

https://www.phoronix.com/review/amd-ryzen-zen5-smt/8

reply
phkahler
20 days ago
[-]
In practice on Zen1 and Zen3 I found HT to provide 20% to 25%.

It seems some benchmarks are as high as 50% with zen5.

AMD really improved HT with Zen 5 by adding a second instruction decoder per core. I expect even better per-thread performance from Zen 6.

reply
Twirrim
20 days ago
[-]
There's also the added fun that using hyperthreading/SMT cores impacts your workload on the corresponding primary cores as well. Depending on your workload SMT can end up being a throughput/latency trade off you might not have been aware of.

Ampere doesn't do SMT, so you get comparable performance across the board.

reply
geerlingguy
20 days ago
[-]
I mentioned it a couple times in the article, but here's a bullet point right in the conclusion:

> "Arm is more efficient": that's not always true—AMD just built the most efficient 192 core server this year, beating Ampere

I also talk about idle power draw being very high in comparison to AMD.

Since I don't have a Turin system to test I didn't do direct wattage comparisons but recommended the linked Phoronix comparison.

reply
loudmax
20 days ago
[-]
Jeff says as much. The AMD EPYC core does offer better performance per watt, which says more about AMD than it does about Ampere.

But also this: "The big difference is the AmpereOne A192-32X is $5,555, while the EPYC 9965 is almost $15,000!"

So you have to look at TCO over a period of years, but that's still quite significant.

reply
magicalhippo
20 days ago
[-]
Dad of a friend ran a store selling designer underwear for men. He sold boxer shorts for $70 over 25 years ago.

He once told me that he also had a rack of thongs. Now, the thongs didn't have a lot of material, obviously, and were inexpensive to manufacture, so if he'd do his regular markup on them they'd end up selling for $15.

"However, notice the price tag is $25. I add an extra $10, because if a guy's looking to buy a thong, he's going to buy a thong".

I think about what he said when I see the chip prices on these high-end server CPUs.

reply
touisteur
20 days ago
[-]
Wondering who actually pays public price on those AMD chips in big-OEM servers. Not saying the AmpereOne isn't also discount-able too, but these public prices always feel like signalling, to me, more than a reference to the actual price. Or maybe I'm lucky...
reply
znpy
18 days ago
[-]
> $5k vs $15k

> So you have to look at TCO over a period of years, but that's still quite significant.

Your point is a valid point. However if we're talking TCO, we should also factor in that TCO also does involve rack space, power (and power efficiency), cooling, cabling, ports on the ToR switch, so on and so forth.

The AMD solution halves pretty much everything around the CPU, and in order to get the same thread count you should really compare 2x$5k vs $15k.

I get that SMT does have performance hits, but such performance penalties mostly show up when you're doing number crunching all day every day. In real world scenarios, where software spends a lot of time waiting either for disk i/o or for network i/o I would expect that to be a non-issue.

AMD's offering is still competitive, in my opinion.

reply
incrudible
20 days ago
[-]
The 9965 is a much faster chip than the Ampere, but also it is at the point of diminishing returns. The Ampere struggles to match much cheaper AMD offerings in the phoronix suite, it is not better value at all.
reply
burnte
20 days ago
[-]
In the video Jeff mentions that the EPYC CPU wins in performance per watt but not per dollar because it's 3x the cost.
reply
monlockandkey
20 days ago
[-]
$15,000 CPU is better than a $5000 CPU?

You can do all the benchmarks you want, but if you don't factor in price, then of course the more expensive product (in general) is going to be better.

It is the same thing with the Snapdragon Elite and Mediatek Dimensity 9400. The SDE is the faster processor and more expensive.

reply
klelatti
20 days ago
[-]
> but has hyper-threading (!!!) meaning you get 384 threads... So this means that AMD is essentially on par with ARM cpus

So you’re saying ‘essentially’ an AMD thread is the same as an ampere core?!

reply
wmf
20 days ago
[-]
Always has been. Most ARM cores are closer to x86 E-cores.
reply
klelatti
20 days ago
[-]
Ampere cores are not ‘most Arm cores’
reply
alfiedotwtf
20 days ago
[-]
Last time I looked at Ampere it was way less power hub but then the competition. EPYC is over 250W at idle!

Off the top of my head, the Ampere 128 core was 250W at max load. Big difference.

Only issue I had was cache size, and you could only buy them from NewEgg.

Oh… and the dual socket gigabyte board had the sockets too close together that I’d you put a big ass heat sink and fan, the outlet of one fan will go straight into the inlet of the other!

reply
zamadatix
20 days ago
[-]
I think you might be comparing your memories of Ampere's CPU power draw to either a full Epyc server's wall power draw or stated TDP. The 128 core EPYC 9754 in the chart linked above has a min of 10 W and a max draw of 397 W even though it outperforms the new 192 core variant AmpereOne A192-32X which had a min of 102 W and a max of 401 W.

The min on the A192 does seem anonymously high though. That said, even the M128-30 has a min of 21 W or about twice that of the AmpereOne.

reply
alfiedotwtf
20 days ago
[-]
Hmmm. Weird because I was definitely out odd EPYC because of idle draw. Ok thanks, that’s back on the table now!!
reply
nagisa
20 days ago
[-]
I wasn't been able to get my 2nd gen EPYC 7642 idle under 100W. Most of that power goes to the I/O die. The cores themselves manage their power very well. While I have all memory slots as well as a number of PCIe lanes occupied (more than 24 available from the consumer chips, but probably less than 48...,) I routinely think about side-grading to something else just to reduce the idle power consumption.

I would be hard pressed to believe that another EPYC would be able to significantly reduce I/O die power consumption significantly, given that AMD hasn't really made many major changes to it over the generations.

reply
alfiedotwtf
17 days ago
[-]
Sorry… another question - what would you recommend these days for a home machine in an office (i.e not a loud and hot rack but something for a quiet but hugely beefy workstation that’s not $9k per CPU)? Preferably more cores the better since my workload scales horizontally quite well (that’s why I looked at Ampere in the first place a while ago and almost did pull the trigger)
reply
alfiedotwtf
19 days ago
[-]
Thanks, and damn I guess…

Maybe I should wait another 18 month cycle

reply
ChocolateGod
20 days ago
[-]
> has hyper-threading (!!!) meaning you get 384 threads

isn't SMT turned off in shared-tenant scenarios now due to security concerns

reply
wmf
20 days ago
[-]
No, you can use core scheduling to assign both threads to the same tenant.
reply
cainxinth
19 days ago
[-]
> High-core-count servers are the cutting edge in datacenters, and they're so insane, most software doesn't even know how to handle it.

Why not? We hit the single thread wall more than a decade ago. Why isn’t more software parallelized by now?

reply
zifpanachr23
18 days ago
[-]
Not all software that people want to run can be efficiently parallelized. There's still a place for larger caches / fatter cores instead of blowing up core count.

In traditional web serving tasks and particularly for commercial cloud providers (usually renting out VM time), this isn't the case, and so I'm sure Ampere will be successful in that space.

reply
nocsi
19 days ago
[-]
There's overhead with parallelization. Security issues, race conditions and resource contention. I think the big one is its ass to debug.
reply
bzzzt
19 days ago
[-]
Almost all service middleware is capable of running tasks in multiple threads so when serving lots of users that's a very easy way to get parallelism. If you're just copying data around (like in an HTTP API implementation) you're mostly I/O bound anyway and adding more compute won't help that much so there's no return on investment.
reply
NotSammyHagar
20 days ago
[-]
Why do these new machines have 192 cores, not a power of 2? (3* 2^6) = 192. The article mentions 256 and 512 cores coming. I get that various components can limit scaling like say bus speed or capacity. But 192, that's wacky.
reply
polynomial
20 days ago
[-]
Don't ask GPT:

> Architectural design: The physical layout of the chip may favor certain core arrangements. For example, a 48-core processor could be designed as a 6x8 grid, which efficiently utilizes the square shape of the silicon die6.[1]

Major Deep Thought vibes.

[1] https://www.perplexity.ai/search/why-would-a-cpu-have-48-cor...

reply
bluedino
20 days ago
[-]
> Using the 6 U.2 slots, you could cache a hundred terabytes of data at the edge, all at ridiculously high speeds.

Who's going to try this? Not Netflix but maybe Cloudflare or a CDN?

reply
phendrenad2
19 days ago
[-]
> Cores are the new megahertz, at least for enterprise servers

I agree. It's too bad the MHz race ended though. Us personal PC users benefit from raw performance more.

reply
BizarroLand
16 days ago
[-]
I wonder if one day we will figure out a method by which the kernel can analyze the software it is running and then automatically distribute the workload in an ideal manner based on resources.

Probably a moot point but most computers leave so much performance on the table due to poor software optimizations and not taking advantage of the systems full capabilities due to having to try to make software work on essentially all computers rather than laser focusing on a specific hardware set.

reply
ThinkBeat
20 days ago
[-]
How much does one of these servers' cost?

Is the estimate that it is cheaper based on comparable server with the same or similar core count?

reply
voxadam
20 days ago
[-]
From the article:

> the AmpereOne A192-32X is $5,555, while the EPYC 9965 is almost $15,000!

reply
incrudible
20 days ago
[-]
That is 192 EPYC cores though. The EPYC 9654 costs half as much as the A192-32x, has half as many cores, but still beats the Ampere in geometric mean of the phoronix suite.

https://www.phoronix.com/review/ampereone-a192-32x/12

reply
menaerus
19 days ago
[-]
> The EPYC 9654 costs half as much as the A192-32x

Where do you see EPYC 9654 being sold for 3k$? Only price I can find online is about the same as A192-32x, around 5k.

reply
cbmuser
20 days ago
[-]
Sounds like x86-64 is starting to lose some of it’s market share very soon.
reply
wmf
20 days ago
[-]
Half of all CPUs in AWS are ARM already.

Update: "Over the last two years, more than 50 percent of all the CPU capacity landed in our datacenters was on AWS Graviton." So not half of all but getting there. https://www.nextplatform.com/2024/12/03/aws-reaps-the-benefi...

reply
droideqa
20 days ago
[-]
I just looked it up - that is a mistaken statistic. 50% of their CPUs are not Arm, but AWS has 50% of all server-side Arm CPUs.

"But that total is beaten by just one company – Amazon – which has slightly above 50 percent of all Arm server CPUs in the world deployed in its Amazon Web Services (AWS) datacenters, said the analyst."[0]

[0]: https://www.theregister.com/2023/08/08/amazon_arm_servers/

reply
Twirrim
20 days ago
[-]
In part that'll be because they mandated every service team migrate to ARM. Service teams had to have extensive justification to avoid it. With good reason, too, the reason for the effort was the significant cost savings.
reply
everfrustrated
20 days ago
[-]
And one presumes also gave them incredible leverage on negotiating future chip buys from AMD/Intel. There must be so much fear of not getting their chips into AWS by now that I can only imagine they're selling near cost.
reply
ksec
20 days ago
[-]
Not to mention no one pays AMD or Intel the Suggested Retail Price for those CPU. You could expect any larger order from DELL or M$ / Google to be 50% off those prices. Of course Amphere would offer some discount as well but when you put together and the performance the difference isn't as big as most claimed.
reply
SahAssar
20 days ago
[-]
That quote says that half of all new CPUs are Graviton. Very different.
reply
hkchad
20 days ago
[-]
reply
penguin_booze
19 days ago
[-]
Traditionally, one of the salient features of ARM systems was that it didn't have fans. Here we are now.
reply
BobbyTables2
19 days ago
[-]
True in more ways than one…
reply
penguin_booze
18 days ago
[-]
Now only fans, huh?
reply
amelius
20 days ago
[-]
> Cores are great, but it's all about how you slice them. Don't think of this as a single 192-core server. Think of it more like 48 dedicated 4-core servers in one box. And each of those servers has 10 gigs of high-speed RAM and consistent performance.

Suddenly it sounds less impressive ...

reply
kevingadd
20 days ago
[-]
To be fair, utilizing 192 cores for a single process or operation is often exceedingly difficult. Scheduling and coordination and resource sharing are all really hard with thread counts that high, so you're probably best operating in terms of smaller clusters of 4-16 threads instead. Lots of algorithms stop scaling well around the 8-32 range.
reply
WorkerBee28474
20 days ago
[-]
> Lots of algorithms stop scaling well around the 8-32 range.

For the curious: https://en.wikipedia.org/wiki/Amdahl's_law

reply
bangaladore
20 days ago
[-]
I think a simple way to think about this is with more cores comes more scheduling complexity. That's without even considering how well you can parallelize your actual tasks.

If instead we break 128 cores into 4 core systems, scheduling becomes a much less complex.

reply
chasil
20 days ago
[-]
I use xargs -P every weekend to back up my (Oracle) database.

It has 510 files that are around 2gb, and the parallel script uses rman to make a datafile copy on each one, lzip it, then scp it to my backup server.

I have xargs set to run 10 at once. Could I increase to 192? Yes.

reply
bee_rider
20 days ago
[-]
So the part that would actually benefit from the 192 cores would just be the lzip, right?
reply
chasil
19 days ago
[-]
Oracle enterprise database actually has a retail license cost of $47,500 per cpu core. On x86, there is a two-for-one license discount.

In my setup, I have an nfs automount set up from a 12-core machine to my 8-core database. The rman backup and the scp happen locally, but yes, the lzip runs on the compute server.

reply
griomnib
20 days ago
[-]
You can also saturate your network link as well. I do like these improvements, but I’m old enough to know it’s always a game of “move the bottleneck”!
reply
stonemetal12
20 days ago
[-]
When was the last time you saw a server with 10GB Ram no matter the number of cores\threads?
reply
com2kid
20 days ago
[-]
I've run plenty of microservices with 256 or 512 GB of RAM, and they were handling large loads. So long as each request is short lived, and using a runtime with low per request overhead (e.g. Node), memory is not really a problem for many types of workloads.
reply
spott
20 days ago
[-]
Just to be clear, you meant MB of RAM, right?
reply
com2kid
20 days ago
[-]
Oops, yeah. MB.
reply
worthless-trash
20 days ago
[-]
I know it's a typo, but you have accidentally coined the term megaservice.
reply
Suppafly
20 days ago
[-]
>When was the last time you saw a server with 10GB Ram no matter the number of cores\threads?

Is that a lot or a little? I have a bunch that only have 8gb, it just depends on what they are being used for.

reply
geerlingguy
20 days ago
[-]
My two primary webservers are running on 2GB of RAM still... it depends on the needs of your application :)
reply
Dylan16807
20 days ago
[-]
If you use the 1U model you can put about 1800 of those basic servers into a single rack. I think that's pretty impressive.

And they have a dual processor 1U too...

reply
sixothree
20 days ago
[-]
I'm not understanding how the 10 GB of RAM gets assigned to a core. Is this just his way to describe memory channels corresponding to the cores?
reply
wmf
20 days ago
[-]
It's not assigned at the hardware level (although Ampere has memory QoS) but you can assign RAM at the VM level which is what these CPUs are intended for.
reply
sixothree
7 days ago
[-]
Thank you.
reply
astrodust
20 days ago
[-]
The article states $5,550USD or so.
reply
torginus
20 days ago
[-]
Now, what percentage of companies could get away with sticking a pair of these (for redundancy) and run their entire operation off of it?
reply
alecco
20 days ago
[-]
The problem is most current systems are not designed for these new architectures. It's like distributed programming but in local silicon.
reply
zackmorris
15 days ago
[-]
I'm late to the party due to work and holiday gatherings, but just wanted to say that this is the first glimmer of hope on the horizon in 25 years.

The slowing down of Moore's law was well-understood by the time I was graduating from college with my Electrical and Computer Engineering (ECE) degree in 1999. The DEC Alpha was 700 MHz and they had GHz chips in labs, but most of us assumed that getting past 3-4 GHz was going to be difficult or impossible due to the switching limit of silicon with respect to pipeline stages. Also, on-chip interconnect had grown from a square profile to a ribbon shape that was taller than wide, which requires waveguide analysis. And features weren't shrinking anymore, they were just adding more layers, which isn't N^2 scaling. The end arrived just after 2007 when smartphones arrived, and the world chose affordability and efficiency over performance. Everyone hopped on the GPU bandwagon and desktop computing was largely abandoned for the next 20 years.

But I've been hearing more about High-Performance Computing (HPC) recently. It's multicore computing with better languages, which is the missing link between CPU and GPU. What happened was, back in the 90s, game companies pushed for OpenGL on GPUs without doing the real work of solving auto-parallelization of code (with no pragmas/intrinsics/compiler hints etc) for multicore computers first. We should have had parallel multicore, then implemented matrix and graphics libraries over that, with OpenGL as a thin wrapper over that DSP framework. But we never got it, so everyone has been manually transpiling their general solutions to domain-specific languages (DSLs) like GLSL, HLSL and CUDA, at exorbitant effort and cost with arbitrary limitations. Established players like Nvidia don't care about this, because the status quo presents barriers to entry to competitors who might disrupt their business model.

Anyway, since HPC is a general computing approach, it can be the foundation for the propriety and esoteric libraries like Vulkan, Metal, etc. Which will democratize programming and let hobbyists more easily dabble in the dozen or so alternatives to Neural Nets. Especially Genetic Algorithms (GAs), which can automatically derive the complex hand-rolled implementations like Large Language Models (LLMs) and Stable Diffusion. We'll also be able to try multiple approaches in simulation and it will be easier to run a high number of agents simultaneously so we can study their interactions and learning models in a transparent and repeatable fashion akin to using reproducible builds and declarative programming.

That's how far John Koza and others got with GAs in the mid-2000s before they had to be abandon them due to cost. Now, this AmpereOne costs about an order of magnitude more than would be required for a hobbyist 256 core computer. But with inflation, $5-10,000 isn't too far off from the price of desktop computers in the 80s. So I expect that prices will come down over 3-5 years as more cores and memory are moved on-chip. Note that we never really got a System on a Chip (SoC) either, so that's some more real work that remains. We also never got reconfigurable hardware with on-chip FPGA units that could solve problems with a divide and conquer strategy, a bit like Just in Time (JIT) compilers.

I had detailed plans in my head for all of this by around 2010. Especially on the software side, with ideas for better languages that combine the power of Octave/MATLAB with the approachability of Go/Erlang. The hardware's pretty easy other than needing a fab, which presents a $1-10 billion barrier. That's why it's so important that AmpereOne exists as a model to copy.

The real lesson for me is that our imaginations can manifest anything, but the bigger the dream, the longer it takes for the universe to deliver it. On another timeline, I might have managed it in 10 years if I had won the internet lottery, but instead I had to wait 25 as I ran the rat race to make rent. Now I'm too old and busted so it's probably too late for me. But if HPC takes off, I might find renewed passion for tech. Since this is the way we deliver Universal Basic Income (UBI) and moneyless resources through 3D printing etc outside of subscription-based AI schemes.

reply
AnimalMuppet
15 days ago
[-]
1. What languages do you think are good ones for HPC?

2. Re GAs, Koza, and cost:

In Koza's book on GAs, in one of the editions, he mentions that they scaled the performance of their research system by five orders of magnitude in a decade. What he didn't mention was that they did it by going from one Lisp Machine to 1000 PCs. They only got two orders of magnitude from per-unit performance; the rest came from growing the number of units.

Of course they couldn't keep scaling that way for cost reasons. Cost, and space, and power, and several other limiting factors. They weren't going to have a million machines a decade later, or a billion machines a decade after that. To the degree that GAs (or at least their approach to them) required that in order to keep working, to that degree their approach was not workable.

reply
zackmorris
13 days ago
[-]
> 2. Re GAs, Koza, and cost:

Ya that's a really good point about the linear scaling limits of genetic algorithms, so let me provide some context:

Where I disagree with the chip industry is that I grew up on a Mac Plus with a 1979 Motorola 68000 processor with 68,000 transistors (not a typo) that ran at 8 MHz and could get real work done - as far as spreadsheets and desktop publishing - arguably more easily than today. So to me, computers waste most of their potential now:

https://en.wikipedia.org/wiki/Transistor_count

As of 2023, Apple's M2 Max had 67 billion transistors running at 3.7 GHz and it's not even the biggest on the list. Although it's bigger than Nvidia's A100 (GA100 Ampere) at 54 billion, which is actually pretty impressive.

If we assume this is all roughly linear, we have:

  year  count   speed
  1979  6.8e4   8e6
  2023  6.7e10  3.7e9
So the M2 should have (~1 million) * (~500) = 500 million times the computing power of the 68000 over 44 years, or Moore's law applied 29 times (44 years/18 months ~= 29).

Are computers today 500 million times faster than a Mac Plus? It's a ridiculous question and the answer is self-evidently no. Without the video card, they are more like 500 times faster in practice, and almost no faster at all in real-world tasks like surfing the web. This leads to some inconvenient questions:

  - Where did the factor of 1 million+ speedup go?
  - Why have CPUs seemingly not even tried to keep up with GPUs?
  - If GPUs are so much better able to recruit their transistor counts, why can't they even run general purpose C-style code and mainstream software like Docker yet?
Like you said and Koza found, the per-unit speed only increased 100 times in the decade of the 2000s by Moore's law because 2^(10 years/18 months) ~= 100. If it went up that much again in the 2010s (it didn't), then that would be a 10,000 times speedup by 2020, and we could extrapolate that each unit today would run about 2^(24 years/18 months) ~= 100,000 times faster than that original lisp machine. A 1000 unit cluster would run 100 million or 10^8 times faster, so research on genetic algorithms could have continued.

But say we wanted to get back into genetic algorithms and the industry gave computer engineers access to real resources. We would design a 100 billion transistor CPU with around 1 million cores having 100,000 transistors each, costing $1000. Or if you like, 10,000 Pentium Pro or PowerPC 604 cores on a chip at 10 million transistors each.

It would also have a content-addressable memory so core-local memory appears as a contiguous address space. It would share data between cores like how BitTorrent works. So no changes to code would be needed to process data in-cluster or distributed around the web.

So that's my answer: computers today run thousands or even millions of times slower than they should for the price, but nobody cares because there's no market for real multicore. Computers were "good enough" by 2007 when smartphones arrived, so that's when their evolution ended. Had their evolution continued, then the linear scaling limits would have fallen away under the exponential growth of Moore's law, and we wouldn't be having this conversation.

> 1. What languages do you think are good ones for HPC?

That's still an open question. IMHO little or no research has happened there, because we didn't have the real multicore CPUs mentioned above. For a little context:

Here are some aspects of good programming languages;

https://www.linkedin.com/pulse/key-qualities-good-programmin...

https://www.chakray.com/programming-languages-types-and-feat...

  Abstractability
  Approachability
  Brevity
  Capability
  Consistency
  Correctness
  Efficiency
  Interactivy
  Locality
  Maintainability
  Performance
  Portability
  Productivity
  Readability
  Reliability
  Reusability
  Scalability
  Security
  Simplicity
  Testability
Most languages only shine for a handful of these. And some don't even seem to try. For example:

OpenCL is trying to address parallelization in all of the wrong ways, by copying the wrong approach (CUDA). Here's a particularly poor example, the top hit on Google:

Calculate X[i] = pow(X[i],2) in 200-ish lines of code:

https://github.com/rsnemmen/OpenCL-examples/blob/master/Hell...

Octave/MATLAB addresses parallelization in all of the right ways, through the language of spreadsheets and matrices:

Calculate X[i] = pow(X[i],2) in 1 line of code as x .^ y or power(x, y):

https://docs.octave.org/v4.4.0/Arithmetic-Ops.html#XREFpower

But unfortunately neither OpenCL nor Octave/MATLAB currently recruit multicore or cluster computers effectively. I think there was research in the 80s and 90s on that, but the languages were esoteric or hand-rolled. Basically hand out a bunch of shell scripts to run and aggregate the results. It all died out around the same time as Beowulf cluster jokes did:

https://en.wikipedia.org/wiki/Computer_cluster

Here's one called Emerald:

https://medium.com/@mwendakelvinblog/emerald-the-language-of...

http://www.emeraldprogramminglanguage.org

https://emeraldlang.github.io/emerald/

After 35 years of programming experience, here's what I want:

  - Multi-element operations by a single operator like in Octave/MATLAB or shader languages like HLSL
  - All variables const (no mutability) with get/set between one-shot executions like the suspend-the-world io of ClojureScript and REDUX state
  - No monads/futures/async or borrow checker like in Rust (const negates their need, just use 2x memory rather than in-place mutation)
  - Pass-by-value copy-on-write semantics for all arguments like with PHP arrays (pass-by-reference classes broke PHP 5+)
  - Auto-parallelization of flow control and loops via static analysis of intermediate code (no pragmas/intrinsics, const avoids side effects)
  - Functional and focused on higher-order methods like map/reduce/filter, de-emphasize Java-style object-oriented classes
  - Smart collections like JavaScript classes "x.y <=> x[y]" and PHP arrays "x[y] <=> x[n]", instead of "pure" set/map/array
  - No ban on multiple inheritance, no final keyword, let the compiler solve inheritance constraints
  - No imports, give us everything and the kitchen sink like PHP, let the compiler strip unused code
  - Parse infix/prefix/postfix notation equally with a converter like goformat, with import/export to spreadsheet and (graph) database
It would be a pure functional language (impurity negates the benefits of cryptic languages like Haskell), easier to read and more consistent than JavaScript, with simple fork/join like Go/Erlang and single-operator math on array/matrix elements like Octave/Matlab. Kind of like standalone spreadsheets connected by shell pipes over the web, but as code in a file. Akin to Jupyter notebooks I guess, or Wolfram Alpha, or literate programming.

Note that this list flies in the face of many best practices today. That's because I'm coming from using languages like HyperTalk that cater to the user rather than the computer.

And honestly I'm trying to get away from languages. I mostly use #nocode spreadsheets and SQL now. I wish I could make cross-platform apps with spreadsheets (a bit like Airtable and Zapier) and had a functional or algebra of sets interface for databases.

It would take me at least a year or two and $100,000-250,000 minimum to write an MVP of this language. It's simply too ambitious to make in my spare time.

Sorry about length and delay on this!

reply
zackmorris
12 days ago
[-]
After sleeping on this, I realized that I forgot to connect why the aspects of good programming languages are important for parallel programming. It's because it's already so difficult, why would we spend unnecessary time working around friction in the language? If we have 1000 or 1 million times the performance, let the language be higher level so the compiler can worry about optimization. I-code can be simplified using math approaches like invariance and equivalence. Basically turning long sequences of instructions into a result and reusing that with memoization. That's how functional programming lazily evaluates code on demand. By treating the unknown result as unsolved and working up the tree algebraically, out of order even. Dependent steps can be farmed out to cores to wait until knowns are solved, then substitute those and solve further. So even non-embarrassingly parallel code can be parallelized in a divide and conquer strategy, limited by Amdahl's Law of course.

I'm concerned that this solver is not being researched enough before machine learning and AI arrive. We'll just gloss over it like we did by jumping from CPU to GPU without the HPC step between them.

At this point I've all but given up on real resources being dedicated to this. I'm basically living on another timeline that never happened, because I'm seeing countless obvious missed steps that nobody seems to be talking about. So it makes living with mediocre tools painful. I spend at least 90% of my time working around friction that doesn't need to be there. In a very real way, it negatively impacts my life, causing me to lose years writing code manually that would have been point and click back in the Microsoft Access and FileMaker days, that people don't even think about when they get real work done with spreadsheets.

TL;DR: I want a human-readable language like HyperTalk that's automagically fully optimized across potentially infinite cores, between where we are now with C-style languages and the AI-generated code that will come from robotic assistants like J.A.R.V.I.S.

reply
binary132
20 days ago
[-]
not when your throughput is dictated by how fast you can process a single task
reply