A CPU that runs entirely on GPU
131 points
8 hours ago
| 21 comments
| github.com
| HN
jagged-chisel
2 hours ago
[-]
“A CPU that runs entirely on the GPU”

I imagine a carefully crafted set of programming primitives used to build up the abstraction of a CPU…

“Every ALU operation is a trained neural network.”

Oh… oh. Fun. Just not the type of “interesting” I was hoping for.

reply
koolala
1 hour ago
[-]
Isn't it interesting it doesn't instantly crash from a precision error? That sounds carefully crafted to me.
reply
robertcprice1
7 minutes ago
[-]
Please tell me what you had in mind so I can try something different!
reply
robertcprice1
5 minutes ago
[-]
Hey everyone thank you taking a look at my project. This was purely just a “can I do it” type deal, but ultimately my goal is to make a running OS purely on GPU, or one composed of learned systems.
reply
user____name
45 minutes ago
[-]
Someone needs to implement LLVMPipe to target this isa, then one can run software OpenGL emulation and call it "hardware accelerated".
reply
bmc7505
6 hours ago
[-]
reply
toolslive
2 hours ago
[-]
reply
DiabloD3
1 hour ago
[-]
reply
bob1029
3 hours ago
[-]
A fun experiment but I wonder how many out there seriously think we could ever completely rid ourselves of the CPU. It seems to be a rising sentiment.

The cost of communicating information through space is dealt with in fundamentally different ways here. On the CPU it is addressed directly. The actual latency is minimized as much as possible, usually by predicting the future in various ways and keeping the spatial extent of each device (core complex) as small as possible. The GPU hides latency with massive parallelism. That's why we can put them across relatively slow networks and still see excellent performance.

Latency hiding cannot deal well in workloads that are branchy and serialized because you can only have one logical thread throughout. The CPU dominates this area because it doesn't cheat. It directly targets the objective. Making efficient, accurate control flow decisions tends to be more valuable than being able to process data in large volumes. It just happens that there are a few exceptions to this rule that are incredibly popular.

reply
fc417fc802
2 hours ago
[-]
> I wonder how many out there seriously think we could ever completely rid ourselves of the CPU.

How do you class systems like the PS5 that have an APU plugged into GDDR instead of regular RAM? The primary remaining issue is the limited memory capacity.

I wonder if we might see a system with GPU class HBM on the package in lieu of VRAM coupled with regular RAM on the board for the CPU portion?

reply
chris_money202
1 hour ago
[-]
I don’t think the remaining issue is memory capacity. CPUs are designed to handle nonlinear memory access and that is how all modern software targeting a CPU is written. GPUs are designed for linear memory access. These are fundamentally different access patterns the optimal solution is to have 2 distinct processing units
reply
zozbot234
43 minutes ago
[-]
If anything, GPUs combine large private per-compute unit private address spaces and a separate shared/global memory, which doesn't mesh very well with linear memory access, just high locality. You can kinda get to the same arrangement on CPU by pushing NUMA (Non-Uniform Memory: only the "global" memory is truly Unified on a GPU!) to the extreme, but that's quite uncommon. "Compute-in-memory" is a related idea that kind of points to the same constraint: you want to maximize spatial locality these days, because moving data in bulk is an expensive operation that burns power.
reply
volemo
2 hours ago
[-]
I see us not getting rid of CPU, but CPU and GPU being eventually consolidated in one system of heterogeneous computing units.
reply
nine_k
1 hour ago
[-]
CPU and GPU have very different ways of scheduling instructions, requiring somehow different interfaces and programming models.. I'd hazard to say that a GPU and CPU with unified memory access (like the Apple's M series, and most mobile chips) is already such a consolidated system.
reply
jagged-chisel
1 hour ago
[-]
Agreed. Much like “RISC is gonna replace everything” - it didn’t. Because the CPU makers incorporated lessons from RISC into their designs.

I can see the same happening to the CPU. It will just take on the appropriate functionality to keep all the compute in the same chip.

It’s gonna take awhile because Nvidia et al like their moats.

reply
zozbot234
1 hour ago
[-]
> It will just take on the appropriate functionality to keep all the compute in the same chip.

So, an iGPU/APU? Those exist already. Regardless, the most GPU-like CPU architecture in common use today is probably SPARC, with its 8-way SMT. Add per-thread vector SIMD compute to something like that, and you end up with something that has broadly similar performance constraints to an iGPU.

reply
jleyank
32 minutes ago
[-]
How is this different than the (various?) efforts back then to build a machine based on the Intel i860? Didn’t work, although people gave it a good try.
reply
nomercy400
3 hours ago
[-]
I was taught years ago that MUL and ADD can be implemented in one or a few cycles. They can be the same complexity. What am I missing here?

Also, is it possible to use the GPU's ADD/MUL implementation? It is what a GPU does best.

reply
volemo
2 hours ago
[-]
To multiply two arbitrary numbers in a single cycle, you need to include dedicated hardware into your ALU, without it you have to combine several additions and logical shifts.

As to why not use the ADD/MUL capabilities of the GPU itself, I guess it wasn’t in the spirit of the challenge. ;)

reply
andrewdb
3 hours ago
[-]
Why do we call them GPUs these days?

Most GPUs, sitting in racks in datacenters, aren't "processing graphics" anyhow.

reply
xeonmc
3 hours ago
[-]
General Processing Units

Gross-Parallelization Units

Generative Procedure Units

Gratuitously Profiteering Unscrupulously

reply
incognito124
2 hours ago
[-]
Greed Processing Units
reply
wartywhoa23
8 minutes ago
[-]
This is just brilliant!
reply
jgtrosh
3 hours ago
[-]
The dedicated term GPGPU [0] didn't catch on.

[0]: https://en.wikipedia.org/wiki/General-purpose_computing_on_g...

reply
CompuHacker
2 hours ago
[-]

  CPU = Compute
  GPU =  Impute
reply
deep1283
5 hours ago
[-]
This is a fun idea. What surprised me is the inversion where MUL ends up faster than ADD because the neural LUT removes sequential dependency while the adder still needs prefix stages.
reply
wartywhoa23
10 minutes ago
[-]
Oh these brave new ways to paraphrase the good old "fuck fuel economy"...

Thank you, Mr. Do-because-I-can!

Yours truly,

- GPU company CEO,

- Electric company CEO.

reply
DonThomasitos
1 hour ago
[-]
I don‘t understand why you would train a NN for an operation like sqrt that the GPU supports in silicon.
reply
nine_k
57 minutes ago
[-]
I see it as a practical joke or a fun hack, like CPUs implemented in the Game of Life, or in Minecraft.
reply
koolala
1 hour ago
[-]
Exciting if an Ai that is helping in its own improvements finds this and incorporates it into its own architecture. Then it starts reading and running all the worlds binary and gains intelligence as a fully actualized "computer". Finally becoming both a master of language and of binary bits. Thinking in poetry and in pure precise numerical calculations.
reply
lorenzohess
6 hours ago
[-]
Out of curiosity, how much slower is this than an actual CPU?
reply
bastawhiz
6 hours ago
[-]
Based on addition and subtraction, 625000x slower or so than a 2.5ghz cpu
reply
medi8r
4 hours ago
[-]
So it could run Doom?
reply
repelsteeltje
3 hours ago
[-]
reply
medi8r
2 hours ago
[-]
Oh I forgot to Doom scroll.
reply
binsquare
3 hours ago
[-]
Can we run doom inside of doom yet?
reply
throawayonthe
2 hours ago
[-]
reply
PowerElectronix
2 hours ago
[-]
What a time to be alive
reply
anthk
1 hour ago
[-]
Doom it's easy. Better the ZMachine with an interpreter based on DFrotz, or another port. Then a game can even run under a Game Boy.

For a similar case, check Eforth+Subleq. If this guy can emulate subleq CPU under a GPU (something like 5 lines under C for the implementation, the rest it's C headers and the file opening function), it can run Eforth and maybe Sokoban.

reply
artemonster
2 hours ago
[-]
Every clueless person who suggest that we move to GPUs entirely have zero idea how things work and basically are suggesting using lambos to plow fields and tractors to race in nascar
reply
madwolf
15 minutes ago
[-]
Bad comparison. Lambos are regularly plowing fields and they're quite good at it. https://www.lamborghini-tractors.com/en-eu/
reply
sudo_cowsay
6 hours ago
[-]
"Multiplication is 12x faster than addition..."

Wow. That's cool but what happens to the regular CPU?

reply
adrian_b
5 hours ago
[-]
This CPU simulator does not attempt to achieve the maximum speed that could be obtained when simulating a CPU on a GPU.

For that a completely different approach would be needed, e.g. by implementing something akin to qemu, where each CPU instruction would be translated into a graphic shader program. On many older GPUs, it is impossible or difficult to launch a graphic program from inside a graphic program (instead of from the CPU), but where this is possible one could obtain a CPU emulation that would be many orders of magnitude faster than what is demonstrated here.

Instead of going for speed, the project demonstrates a simpler self-contained implementation based on the same kind of neural networks used for ML/AI, which might work even on an NPU, not only on a GPU.

Because it uses inappropriate hardware execution units, the speed is modest and the speed ratios between different kinds of instructions are weird, but nonetheless this is an impressive achievement, i.e. simulating the complete Aarch64 ISA with such means.

reply
5o1ecist
4 hours ago
[-]
> where each CPU instruction would be translated into a graphic shader program

You really think having a shader per CPU-instruction is going to get you closer to the highest possible speed one can achieve?

reply
koolala
2 hours ago
[-]
If its bindless and pre-compiled why not? What's a faster way?
reply
throawayonthe
2 hours ago
[-]
very tangentially related is whatever vectorware et al are doing: https://www.vectorware.com/blog/
reply
RagnarD
6 hours ago
[-]
Being able to perform precise math in an LLM is important, glad to see this.
reply
koolala
2 hours ago
[-]
That would be cool. A way to read cpu assembly bytecode and then think in it.

It's slower than real cpu code obviously but still crazy fast for 'thinking' about it. They wouldn't need to actually simulate an entire program in a never ending hot loop like a real computer. Just a few loops would explain a lot about a process and calculate a lot of precise information.

reply
jdjdndnzn
6 hours ago
[-]
Just want to point out this comment is highly ironic.

This is all a computer does :P

We need llms to be able to tap that not add the same functionality a layer above and MUCH less efficiently.

reply
Nuzzerino
6 hours ago
[-]
> We need llms to be able to tap that not add the same functionality a layer above and MUCH less efficiently.

Agents, tool-integrated reasoning, even chain of thought (limited, for some math) can address this.

reply
RagnarD
4 hours ago
[-]
You're both completely missing the point. It's important that an LLM be able to perform exact arithmetic reliably without a tool call. Of course the underlying hardware does so extremely rapidly, that's not the point.
reply
kruffalon
52 minutes ago
[-]
Could you explain why that is?
reply
koolala
37 minutes ago
[-]
A tool call is like 100,000,000x slower isnt it?
reply
jdjdndnzn
2 hours ago
[-]
The computer ALREADY does do math reliably. You are missing the point.
reply
5o1ecist
4 hours ago
[-]
Why?
reply
nicman23
6 hours ago
[-]
can i run linux on a nvidia card though?
reply
micw
5 hours ago
[-]
Linux runs everywhere
reply
volemo
2 hours ago
[-]
Except on my stupid iPad “Pro”. :(
reply
mrlonglong
5 hours ago
[-]
Now I've seen it all. Time to die.. (meant humourously)
reply
Surac
6 hours ago
[-]
Well GPU are just special purpous CPU.
reply
MadnessASAP
5 hours ago
[-]
Ya know just today I was thinking around a way to compile a neural network down to assembly. Matching and replacing neural network structures with their closest machine code equivalent.

This is way cooler though! Instead of efficiently running a neural network on a CPU, I can inefficiently run my CPU on neural network! With the work being done to make more powerful GPUs and ASICs I bet in a few years I'll be able to run a 486 at 100MHz(!!) with power consumption just under a megawatt! The mind boggles at the sort of computations this will unlock!

Few more years and I'll even be able to realise the dream of self-hosting ChatGPT on my own neural network simulated CPU!

reply