I imagine a carefully crafted set of programming primitives used to build up the abstraction of a CPU…
“Every ALU operation is a trained neural network.”
Oh… oh. Fun. Just not the type of “interesting” I was hoping for.
[1]: https://breandan.net/2020/06/30/graph-computation#roadmap
The cost of communicating information through space is dealt with in fundamentally different ways here. On the CPU it is addressed directly. The actual latency is minimized as much as possible, usually by predicting the future in various ways and keeping the spatial extent of each device (core complex) as small as possible. The GPU hides latency with massive parallelism. That's why we can put them across relatively slow networks and still see excellent performance.
Latency hiding cannot deal well in workloads that are branchy and serialized because you can only have one logical thread throughout. The CPU dominates this area because it doesn't cheat. It directly targets the objective. Making efficient, accurate control flow decisions tends to be more valuable than being able to process data in large volumes. It just happens that there are a few exceptions to this rule that are incredibly popular.
How do you class systems like the PS5 that have an APU plugged into GDDR instead of regular RAM? The primary remaining issue is the limited memory capacity.
I wonder if we might see a system with GPU class HBM on the package in lieu of VRAM coupled with regular RAM on the board for the CPU portion?
I can see the same happening to the CPU. It will just take on the appropriate functionality to keep all the compute in the same chip.
It’s gonna take awhile because Nvidia et al like their moats.
So, an iGPU/APU? Those exist already. Regardless, the most GPU-like CPU architecture in common use today is probably SPARC, with its 8-way SMT. Add per-thread vector SIMD compute to something like that, and you end up with something that has broadly similar performance constraints to an iGPU.
Also, is it possible to use the GPU's ADD/MUL implementation? It is what a GPU does best.
As to why not use the ADD/MUL capabilities of the GPU itself, I guess it wasn’t in the spirit of the challenge. ;)
Most GPUs, sitting in racks in datacenters, aren't "processing graphics" anyhow.
Gross-Parallelization Units
Generative Procedure Units
Gratuitously Profiteering Unscrupulously
[0]: https://en.wikipedia.org/wiki/General-purpose_computing_on_g...
CPU = Compute
GPU = ImputeThank you, Mr. Do-because-I-can!
Yours truly,
- GPU company CEO,
- Electric company CEO.
For a similar case, check Eforth+Subleq. If this guy can emulate subleq CPU under a GPU (something like 5 lines under C for the implementation, the rest it's C headers and the file opening function), it can run Eforth and maybe Sokoban.
Wow. That's cool but what happens to the regular CPU?
For that a completely different approach would be needed, e.g. by implementing something akin to qemu, where each CPU instruction would be translated into a graphic shader program. On many older GPUs, it is impossible or difficult to launch a graphic program from inside a graphic program (instead of from the CPU), but where this is possible one could obtain a CPU emulation that would be many orders of magnitude faster than what is demonstrated here.
Instead of going for speed, the project demonstrates a simpler self-contained implementation based on the same kind of neural networks used for ML/AI, which might work even on an NPU, not only on a GPU.
Because it uses inappropriate hardware execution units, the speed is modest and the speed ratios between different kinds of instructions are weird, but nonetheless this is an impressive achievement, i.e. simulating the complete Aarch64 ISA with such means.
You really think having a shader per CPU-instruction is going to get you closer to the highest possible speed one can achieve?
It's slower than real cpu code obviously but still crazy fast for 'thinking' about it. They wouldn't need to actually simulate an entire program in a never ending hot loop like a real computer. Just a few loops would explain a lot about a process and calculate a lot of precise information.
This is all a computer does :P
We need llms to be able to tap that not add the same functionality a layer above and MUCH less efficiently.
Agents, tool-integrated reasoning, even chain of thought (limited, for some math) can address this.
This is way cooler though! Instead of efficiently running a neural network on a CPU, I can inefficiently run my CPU on neural network! With the work being done to make more powerful GPUs and ASICs I bet in a few years I'll be able to run a 486 at 100MHz(!!) with power consumption just under a megawatt! The mind boggles at the sort of computations this will unlock!
Few more years and I'll even be able to realise the dream of self-hosting ChatGPT on my own neural network simulated CPU!