Solaris seemed to be much more solid than linux overall.
Solaris 10 was intended as a catch-up release but the package manager took years to mature. I remember getting racks of systems in the mid-2000s where simply running the updater on brand new servers rendered them unbootable. By then the writing was on the wall.
SPARC was similar: the hardware had some neat aspects but what they really needed was the equivalent of a first-class LLVM backend now. Statistically nobody was hand-tuning assembly code to try to get it to be competitive with x86, especially since that meant not only your own code but most other libraries. The reason ARM is doing so well now is that two decades where the most popular computing devices are phones means that you can just assume that anything popular will compile and run well on ARM.
Like, to update cron. Even though through 3 versions of the same patch id, it does and does not fix cron.
Pure insanity with no product focus on user experience.
[1] https://azure.microsoft.com/en-us/blog/azure-virtual-machine...
Nearly all programs have a constant memory usage regardless of how fast the core running them is. If you deliberately make your cores half the speed, and double the amount of them, you have approximately doubled the cost of the memory you need. In aggregate, memory already costs so much more than CPUs that this is rarely if ever useful, even if it meant your cpu is free.
The starting point of a design that tiles a lot of cores just needs to be ones of the fastest cores available, or it is not commercially viable.
> Also, with 512 gigs of RAM and a massive CPU, it can run a 405 billion parameter Large Language Model. It's not fast, but it did run, giving me just under a token per second.
If you're serious about running LLMs and you can afford it, you'll of course want GPUs. But this might be a relatively affordable way to run really huge models like Llama 405B on your own hardware. This could be even more plausible on Ampere's upcoming 512-core CPU, though RAM bandwidth might be more of a bottleneck than CPU cores. Probably a niche use case, but intriguing.
For some people, they lack the resources to run an LLM on a GPU. For others, they want to try certain models without buying thousands of dollars of equipment just to try things out.
Either way, I see too many people putting the proverbial horse before the cart: they buy a video card, then try to fit LLMs in to the limited VRAM they have, instead of playing around, even if at 1/10th the speed, and figuring out which models they want to run before deciding where they want to invest their money.
One token a second is worlds better than running nothing at all because someone told you that you shouldn't or can't because you don't have a fancy, expensive GPU.
Most people have a usable iGPU, that's going to run most models significantly slower (because less available memory throughput, and/or more of it being wasted on padding, compared to CPU) but a lot cooler than the CPU. NPU's will likely be a similar story.
It would be nice if there was an easy way to only run the initial prompt+context processing (which is generally compute bound) on iGPU+NPU, but move to CPU for the token generation stage.
It's a minuscule pittance, on hardware that costs as much as an AmpereOne.
Yes, it is.
But it has not been updated for seven months. Do things change so slowly?
Always funny to see 240V power described as exotic (or at least exotic-adjacent). It's the standard across practically the entire world except North and Central America!
https://en.wikipedia.org/wiki/Split-phase_electric_power
This is different from Europe where we usually distribute 3 phases 130V.
Most homes don't have more than a few 240v circuits, and they are usually always dedicated to a single device. My house has 4: range, oven, drier, water heater, all with only a single outlet. The concept of wiring split-phase 240V is not difficult, but getting 4-conductor 12-gauge wiring run to a new location in an existing building is not exactly easy.
So, the power delivery type/format _isn't_ exotic, but the ability to use it in arbitrary locations in a [residential] building, compared to 120V being in every room (though perhaps not 120V/20A in every room)
As a sidenote, my apartment has a 3x63A main fuse. Three-phase is everywhere here. Really convenient for ev charging too.
Meanwhile, also here in the States: The wiring for my very normal (240v) electric range is the size of a baby's arm and might be even stiffer. (That last claim may be a bit hyperbolic: I've bent at least my share of 6/3 wire, but I don't think I've ever bent a baby's arm.)
But other than that we are only 2 years away from getting this into our hand a single server / node with 512 vCPU. Or may be 1024 vCPU if we do Dual Socket. Along with PCI-e 6.0 SSD. We can rent one or two of these server and forget about scaling issues for 95%+ of most use cases.
Not to mention nearly all dynamic languages has gotten a lot faster with JIT.
It isn't mentioned anywhere in __this__ article, but the power draw of that chip is 276W. I got this from Phoronix [1]:
> The AmpereOne A192-32X boasts 192 AmpereOne cores, a 3.2GHz clock frequency, a rated 276 Watt usage power
Which is interesting because it's almost half of AMD's 192-core offering [2].
Why is this interesting? the AMD offering draws a little less than double the wattage but has hyper-threading (!!!) meaning you get 384 threads... So this means that AMD is essentially on par with ARM cpus (at least with Ampere's ARM cpus) in terms of power efficiency... Maybe a little better.
I'd be more inclined to think that AMD is the king/queen of power efficiency in the datacenter rather than ARM/Ampere [3].
notes:
[1]: https://www.phoronix.com/review/ampereone-a192-32x
[2]: https://www.youtube.com/watch?v=S-NbCPEgP1A
[3]: regarding Graviton/Axiom or "alleged" server-class apple silicon... Their power draw is essentially irrelevant as they're all claim that cannot be tested and evaluated independently, so they don't count in my opinion.
It seems some benchmarks are as high as 50% with zen5.
AMD really improved HT with Zen 5 by adding a second instruction decoder per core. I expect even better per-thread performance from Zen 6.
Ampere doesn't do SMT, so you get comparable performance across the board.
> "Arm is more efficient": that's not always true—AMD just built the most efficient 192 core server this year, beating Ampere
I also talk about idle power draw being very high in comparison to AMD.
Since I don't have a Turin system to test I didn't do direct wattage comparisons but recommended the linked Phoronix comparison.
But also this: "The big difference is the AmpereOne A192-32X is $5,555, while the EPYC 9965 is almost $15,000!"
So you have to look at TCO over a period of years, but that's still quite significant.
He once told me that he also had a rack of thongs. Now, the thongs didn't have a lot of material, obviously, and were inexpensive to manufacture, so if he'd do his regular markup on them they'd end up selling for $15.
"However, notice the price tag is $25. I add an extra $10, because if a guy's looking to buy a thong, he's going to buy a thong".
I think about what he said when I see the chip prices on these high-end server CPUs.
> So you have to look at TCO over a period of years, but that's still quite significant.
Your point is a valid point. However if we're talking TCO, we should also factor in that TCO also does involve rack space, power (and power efficiency), cooling, cabling, ports on the ToR switch, so on and so forth.
The AMD solution halves pretty much everything around the CPU, and in order to get the same thread count you should really compare 2x$5k vs $15k.
I get that SMT does have performance hits, but such performance penalties mostly show up when you're doing number crunching all day every day. In real world scenarios, where software spends a lot of time waiting either for disk i/o or for network i/o I would expect that to be a non-issue.
AMD's offering is still competitive, in my opinion.
You can do all the benchmarks you want, but if you don't factor in price, then of course the more expensive product (in general) is going to be better.
It is the same thing with the Snapdragon Elite and Mediatek Dimensity 9400. The SDE is the faster processor and more expensive.
So you’re saying ‘essentially’ an AMD thread is the same as an ampere core?!
Off the top of my head, the Ampere 128 core was 250W at max load. Big difference.
Only issue I had was cache size, and you could only buy them from NewEgg.
Oh… and the dual socket gigabyte board had the sockets too close together that I’d you put a big ass heat sink and fan, the outlet of one fan will go straight into the inlet of the other!
The min on the A192 does seem anonymously high though. That said, even the M128-30 has a min of 21 W or about twice that of the AmpereOne.
I would be hard pressed to believe that another EPYC would be able to significantly reduce I/O die power consumption significantly, given that AMD hasn't really made many major changes to it over the generations.
Maybe I should wait another 18 month cycle
isn't SMT turned off in shared-tenant scenarios now due to security concerns
Why not? We hit the single thread wall more than a decade ago. Why isn’t more software parallelized by now?
In traditional web serving tasks and particularly for commercial cloud providers (usually renting out VM time), this isn't the case, and so I'm sure Ampere will be successful in that space.
> Architectural design: The physical layout of the chip may favor certain core arrangements. For example, a 48-core processor could be designed as a 6x8 grid, which efficiently utilizes the square shape of the silicon die6.[1]
Major Deep Thought vibes.
[1] https://www.perplexity.ai/search/why-would-a-cpu-have-48-cor...
Who's going to try this? Not Netflix but maybe Cloudflare or a CDN?
I agree. It's too bad the MHz race ended though. Us personal PC users benefit from raw performance more.
Probably a moot point but most computers leave so much performance on the table due to poor software optimizations and not taking advantage of the systems full capabilities due to having to try to make software work on essentially all computers rather than laser focusing on a specific hardware set.
Is the estimate that it is cheaper based on comparable server with the same or similar core count?
> the AmpereOne A192-32X is $5,555, while the EPYC 9965 is almost $15,000!
Where do you see EPYC 9654 being sold for 3k$? Only price I can find online is about the same as A192-32x, around 5k.
Update: "Over the last two years, more than 50 percent of all the CPU capacity landed in our datacenters was on AWS Graviton." So not half of all but getting there. https://www.nextplatform.com/2024/12/03/aws-reaps-the-benefi...
"But that total is beaten by just one company – Amazon – which has slightly above 50 percent of all Arm server CPUs in the world deployed in its Amazon Web Services (AWS) datacenters, said the analyst."[0]
[0]: https://www.theregister.com/2023/08/08/amazon_arm_servers/
Suddenly it sounds less impressive ...
For the curious: https://en.wikipedia.org/wiki/Amdahl's_law
If instead we break 128 cores into 4 core systems, scheduling becomes a much less complex.
It has 510 files that are around 2gb, and the parallel script uses rman to make a datafile copy on each one, lzip it, then scp it to my backup server.
I have xargs set to run 10 at once. Could I increase to 192? Yes.
In my setup, I have an nfs automount set up from a 12-core machine to my 8-core database. The rman backup and the scp happen locally, but yes, the lzip runs on the compute server.
Is that a lot or a little? I have a bunch that only have 8gb, it just depends on what they are being used for.
And they have a dual processor 1U too...
The slowing down of Moore's law was well-understood by the time I was graduating from college with my Electrical and Computer Engineering (ECE) degree in 1999. The DEC Alpha was 700 MHz and they had GHz chips in labs, but most of us assumed that getting past 3-4 GHz was going to be difficult or impossible due to the switching limit of silicon with respect to pipeline stages. Also, on-chip interconnect had grown from a square profile to a ribbon shape that was taller than wide, which requires waveguide analysis. And features weren't shrinking anymore, they were just adding more layers, which isn't N^2 scaling. The end arrived just after 2007 when smartphones arrived, and the world chose affordability and efficiency over performance. Everyone hopped on the GPU bandwagon and desktop computing was largely abandoned for the next 20 years.
But I've been hearing more about High-Performance Computing (HPC) recently. It's multicore computing with better languages, which is the missing link between CPU and GPU. What happened was, back in the 90s, game companies pushed for OpenGL on GPUs without doing the real work of solving auto-parallelization of code (with no pragmas/intrinsics/compiler hints etc) for multicore computers first. We should have had parallel multicore, then implemented matrix and graphics libraries over that, with OpenGL as a thin wrapper over that DSP framework. But we never got it, so everyone has been manually transpiling their general solutions to domain-specific languages (DSLs) like GLSL, HLSL and CUDA, at exorbitant effort and cost with arbitrary limitations. Established players like Nvidia don't care about this, because the status quo presents barriers to entry to competitors who might disrupt their business model.
Anyway, since HPC is a general computing approach, it can be the foundation for the propriety and esoteric libraries like Vulkan, Metal, etc. Which will democratize programming and let hobbyists more easily dabble in the dozen or so alternatives to Neural Nets. Especially Genetic Algorithms (GAs), which can automatically derive the complex hand-rolled implementations like Large Language Models (LLMs) and Stable Diffusion. We'll also be able to try multiple approaches in simulation and it will be easier to run a high number of agents simultaneously so we can study their interactions and learning models in a transparent and repeatable fashion akin to using reproducible builds and declarative programming.
That's how far John Koza and others got with GAs in the mid-2000s before they had to be abandon them due to cost. Now, this AmpereOne costs about an order of magnitude more than would be required for a hobbyist 256 core computer. But with inflation, $5-10,000 isn't too far off from the price of desktop computers in the 80s. So I expect that prices will come down over 3-5 years as more cores and memory are moved on-chip. Note that we never really got a System on a Chip (SoC) either, so that's some more real work that remains. We also never got reconfigurable hardware with on-chip FPGA units that could solve problems with a divide and conquer strategy, a bit like Just in Time (JIT) compilers.
I had detailed plans in my head for all of this by around 2010. Especially on the software side, with ideas for better languages that combine the power of Octave/MATLAB with the approachability of Go/Erlang. The hardware's pretty easy other than needing a fab, which presents a $1-10 billion barrier. That's why it's so important that AmpereOne exists as a model to copy.
The real lesson for me is that our imaginations can manifest anything, but the bigger the dream, the longer it takes for the universe to deliver it. On another timeline, I might have managed it in 10 years if I had won the internet lottery, but instead I had to wait 25 as I ran the rat race to make rent. Now I'm too old and busted so it's probably too late for me. But if HPC takes off, I might find renewed passion for tech. Since this is the way we deliver Universal Basic Income (UBI) and moneyless resources through 3D printing etc outside of subscription-based AI schemes.
2. Re GAs, Koza, and cost:
In Koza's book on GAs, in one of the editions, he mentions that they scaled the performance of their research system by five orders of magnitude in a decade. What he didn't mention was that they did it by going from one Lisp Machine to 1000 PCs. They only got two orders of magnitude from per-unit performance; the rest came from growing the number of units.
Of course they couldn't keep scaling that way for cost reasons. Cost, and space, and power, and several other limiting factors. They weren't going to have a million machines a decade later, or a billion machines a decade after that. To the degree that GAs (or at least their approach to them) required that in order to keep working, to that degree their approach was not workable.
Ya that's a really good point about the linear scaling limits of genetic algorithms, so let me provide some context:
Where I disagree with the chip industry is that I grew up on a Mac Plus with a 1979 Motorola 68000 processor with 68,000 transistors (not a typo) that ran at 8 MHz and could get real work done - as far as spreadsheets and desktop publishing - arguably more easily than today. So to me, computers waste most of their potential now:
https://en.wikipedia.org/wiki/Transistor_count
As of 2023, Apple's M2 Max had 67 billion transistors running at 3.7 GHz and it's not even the biggest on the list. Although it's bigger than Nvidia's A100 (GA100 Ampere) at 54 billion, which is actually pretty impressive.
If we assume this is all roughly linear, we have:
year count speed
1979 6.8e4 8e6
2023 6.7e10 3.7e9
So the M2 should have (~1 million) * (~500) = 500 million times the computing power of the 68000 over 44 years, or Moore's law applied 29 times (44 years/18 months ~= 29).Are computers today 500 million times faster than a Mac Plus? It's a ridiculous question and the answer is self-evidently no. Without the video card, they are more like 500 times faster in practice, and almost no faster at all in real-world tasks like surfing the web. This leads to some inconvenient questions:
- Where did the factor of 1 million+ speedup go?
- Why have CPUs seemingly not even tried to keep up with GPUs?
- If GPUs are so much better able to recruit their transistor counts, why can't they even run general purpose C-style code and mainstream software like Docker yet?
Like you said and Koza found, the per-unit speed only increased 100 times in the decade of the 2000s by Moore's law because 2^(10 years/18 months) ~= 100. If it went up that much again in the 2010s (it didn't), then that would be a 10,000 times speedup by 2020, and we could extrapolate that each unit today would run about 2^(24 years/18 months) ~= 100,000 times faster than that original lisp machine. A 1000 unit cluster would run 100 million or 10^8 times faster, so research on genetic algorithms could have continued.But say we wanted to get back into genetic algorithms and the industry gave computer engineers access to real resources. We would design a 100 billion transistor CPU with around 1 million cores having 100,000 transistors each, costing $1000. Or if you like, 10,000 Pentium Pro or PowerPC 604 cores on a chip at 10 million transistors each.
It would also have a content-addressable memory so core-local memory appears as a contiguous address space. It would share data between cores like how BitTorrent works. So no changes to code would be needed to process data in-cluster or distributed around the web.
So that's my answer: computers today run thousands or even millions of times slower than they should for the price, but nobody cares because there's no market for real multicore. Computers were "good enough" by 2007 when smartphones arrived, so that's when their evolution ended. Had their evolution continued, then the linear scaling limits would have fallen away under the exponential growth of Moore's law, and we wouldn't be having this conversation.
> 1. What languages do you think are good ones for HPC?
That's still an open question. IMHO little or no research has happened there, because we didn't have the real multicore CPUs mentioned above. For a little context:
Here are some aspects of good programming languages;
https://www.linkedin.com/pulse/key-qualities-good-programmin...
https://www.chakray.com/programming-languages-types-and-feat...
Abstractability
Approachability
Brevity
Capability
Consistency
Correctness
Efficiency
Interactivy
Locality
Maintainability
Performance
Portability
Productivity
Readability
Reliability
Reusability
Scalability
Security
Simplicity
Testability
Most languages only shine for a handful of these. And some don't even seem to try. For example:OpenCL is trying to address parallelization in all of the wrong ways, by copying the wrong approach (CUDA). Here's a particularly poor example, the top hit on Google:
Calculate X[i] = pow(X[i],2) in 200-ish lines of code:
https://github.com/rsnemmen/OpenCL-examples/blob/master/Hell...
Octave/MATLAB addresses parallelization in all of the right ways, through the language of spreadsheets and matrices:
Calculate X[i] = pow(X[i],2) in 1 line of code as x .^ y or power(x, y):
https://docs.octave.org/v4.4.0/Arithmetic-Ops.html#XREFpower
But unfortunately neither OpenCL nor Octave/MATLAB currently recruit multicore or cluster computers effectively. I think there was research in the 80s and 90s on that, but the languages were esoteric or hand-rolled. Basically hand out a bunch of shell scripts to run and aggregate the results. It all died out around the same time as Beowulf cluster jokes did:
https://en.wikipedia.org/wiki/Computer_cluster
Here's one called Emerald:
https://medium.com/@mwendakelvinblog/emerald-the-language-of...
http://www.emeraldprogramminglanguage.org
https://emeraldlang.github.io/emerald/
After 35 years of programming experience, here's what I want:
- Multi-element operations by a single operator like in Octave/MATLAB or shader languages like HLSL
- All variables const (no mutability) with get/set between one-shot executions like the suspend-the-world io of ClojureScript and REDUX state
- No monads/futures/async or borrow checker like in Rust (const negates their need, just use 2x memory rather than in-place mutation)
- Pass-by-value copy-on-write semantics for all arguments like with PHP arrays (pass-by-reference classes broke PHP 5+)
- Auto-parallelization of flow control and loops via static analysis of intermediate code (no pragmas/intrinsics, const avoids side effects)
- Functional and focused on higher-order methods like map/reduce/filter, de-emphasize Java-style object-oriented classes
- Smart collections like JavaScript classes "x.y <=> x[y]" and PHP arrays "x[y] <=> x[n]", instead of "pure" set/map/array
- No ban on multiple inheritance, no final keyword, let the compiler solve inheritance constraints
- No imports, give us everything and the kitchen sink like PHP, let the compiler strip unused code
- Parse infix/prefix/postfix notation equally with a converter like goformat, with import/export to spreadsheet and (graph) database
It would be a pure functional language (impurity negates the benefits of cryptic languages like Haskell), easier to read and more consistent than JavaScript, with simple fork/join like Go/Erlang and single-operator math on array/matrix elements like Octave/Matlab. Kind of like standalone spreadsheets connected by shell pipes over the web, but as code in a file. Akin to Jupyter notebooks I guess, or Wolfram Alpha, or literate programming.Note that this list flies in the face of many best practices today. That's because I'm coming from using languages like HyperTalk that cater to the user rather than the computer.
And honestly I'm trying to get away from languages. I mostly use #nocode spreadsheets and SQL now. I wish I could make cross-platform apps with spreadsheets (a bit like Airtable and Zapier) and had a functional or algebra of sets interface for databases.
It would take me at least a year or two and $100,000-250,000 minimum to write an MVP of this language. It's simply too ambitious to make in my spare time.
Sorry about length and delay on this!
I'm concerned that this solver is not being researched enough before machine learning and AI arrive. We'll just gloss over it like we did by jumping from CPU to GPU without the HPC step between them.
At this point I've all but given up on real resources being dedicated to this. I'm basically living on another timeline that never happened, because I'm seeing countless obvious missed steps that nobody seems to be talking about. So it makes living with mediocre tools painful. I spend at least 90% of my time working around friction that doesn't need to be there. In a very real way, it negatively impacts my life, causing me to lose years writing code manually that would have been point and click back in the Microsoft Access and FileMaker days, that people don't even think about when they get real work done with spreadsheets.
TL;DR: I want a human-readable language like HyperTalk that's automagically fully optimized across potentially infinite cores, between where we are now with C-style languages and the AI-generated code that will come from robotic assistants like J.A.R.V.I.S.