I see a lot of comments here saying how underwhelming it was. And that’s probably true.
But one thing always surprised me. The quality of the games for PS3 at the platform end of life were gorgeous. Developers became so good at extracting all the power from the platform (and it had some in its difficult to use way), that great things were achieved.
> A suggestion from me is to provide sources, and also maybe an epub of this
What do you mean?
I knew IBM was involved in the design of the Cell BE, but I had no idea some successor of IBM's token ring tech (at least the concept of it) lived on in it. I'm sure there's other hardware (probably mainframe hardware) in and before that 2006 with similar interconnects.
The big issue with the Cell architecture is that it was designed to act as a GPU as well which they realized later in development that it won't be powerful enough for those graphics and they'll still need a dedicated GPU in addition. That's why the Cell is such a franken-cpu compared to the vanilla IBM PowerPC it's based on.
The Cell architecture was also a product of it's time. In the early 00s when they started Cell development nobody would have expected that X86 would have made such leaps by the time the PS3 hit the market.
Was the PS3 the one that was banned from some countries? And wasn't the PS2 rumored to be used a ballistic missile guidance chip for some country?
I do have the odd anecdote, way back in the day, I was in a CompUSA in Dearborn MI and overheard a middle eastern guy at the counter asking if they had any PS2s. When they said no (this was a point where availability was low) instead he bought bought at least 5 (might have been 10?) PS1s.
Because emulators still work insanely hard to make those games work, even today.
Development kits for the Xbox 360 used Power Mac G5s, because they were the same architecture as the Xbox 360, and modern Xbox and PlayStation development kits use x86 processors, again because there's no change in architecture.
Granted, you can't easily get a computer with a cell processor, but it isn't because of a lack of trying. Sony worked with IBM and Toshiba on designing and manufacturing the cell processor, and all three developed products using it, but the only one that was successful was Sony's PlayStation 3, but its success was likely despite the cell processor, not because of it.
Why go through the pain to design such thing, which makes it difficult for developers, while I don't think it would really result in better performance.
[1] And home computers, but that ended a couple decades earlier than consoles.
i always found this very appealing, having a blazing fast memory under programmer control so i wonder: why don't we have that on other cpus?
"The local store does not operate like a conventional CPU cache since it is neither transparent to software nor does it contain hardware structures that predict which data to load."
I think the general term for this is scratchpad memory. https://en.wikipedia.org/wiki/Scratchpad_memory
This kind of indicates the problem with it. When switching tasks, each local store would have to be put into main RAM and the new task's local stores pulled back out. This would make switching tasks increasingly expensive. I believe the PS3 (and maybe all cell processors) dealt with this by not having tasks switch on the SPUs.
Pure speculation from my side, but I'd think that the advantages over traditional big register banks and on-chip caches are not that great, especially when you're writing 'cache-aware code'. You also need to consider that the PS3 was full of design compromises to keep cost down, e.g. there simply might not have been enough die space for a cache controller for each SPU, or the die space was more vaulable to get a few more kilobytes of static scratch memory instead of the cache logic.
Also, AFAIK on some GPU architectures you have something similar like per-core static scratch space, that's where restrictions are coming from that uniform data per shader invocation may at most be 64 KBytes on some GPU architectures, etc...
This is where a lot of their performance comes from.
The main disadvantage of such dedicated memory is inefficient usage compared to using that same amount of fast local memory to cache _all_ of main memory.
https://en.wikipedia.org/wiki/Xbox_360_technical_specificati...
In general most developers struggled to do much with it, it was just too small (combined with the fiddlyness of using it).
PS2 programmer's were very used to thinking in this way as it's how the rendering had to be done. There is a couple of vector units, and one of them is connected to the GPU, so the general structure most developers followed was to have 4 buffers in the VU memory (I think it only had 16kb of memory or something pretty small), but essentially in parallel you'd have:
1. New data being DMAd in from main memory to VU memory (into say buffer 1/4). 2. Previous data in buffer 3/4 being transformed, lit, coloured, etc and output into buffer 4/4. 3. Data from buffer 2/4 being sent/rendered by the GPU.
Then once the above had finished it would flip, so you'd alternate like:
Data in: B1 (main memory to VU) Data out: B2 (VU to GPU) Data process from: B3 (VU processing) Data process to: B4 (VU processing)
Data in: B3 Data out: B4 Data process from: B1 Data process to: B2
The VU has two pipelines running in parallel (float and integer), and every instruction had an exact number of cycles to process, if you read a result before it is ready you stall the pipeline, so you had to painstakingly interleave and order your instructions to process three verts at a time and be very clever about register pressure etc.
There is obviously some clever syncing logic to allow all of this to work, allowing the DMA to wait until the VU kicks off the next GPU batch etc.
It was complex to get your head around, set up all the moving parts and debug when it goes wrong. When it goes wrong it pretty much just hangs, so you had to write a lot of validators. On PS2 you basically spend the frame building up a huge DMA list, and then at the end of the frame kick it off and it renders everything, so the DMA will transfer VU programs to the VU, upload data to the VU, wait for it to process and upload next batch, at the end upload next program, upload settings to GPU registers, bacially everything. Once that DMA is kicked off no more CPU code is involved in rendering the frame, so you have a MB or so of pure memory transfer instructions firing off, if any of them are wrong you are in a world of pain.
Then throw in, just to keep things interesting, the fact that anything you write to memory is likely stuck in caches, and DMA doesn't seem caches, so extra care has to be taken to make sure caches are flushed before using DMA.
It was a magical, horrible, wonderful, painful, joyous, impossible, satisfying, sickening, amazing time.
We do, it's called "cache" or "registers".
In some ways it's like cache, it has the latency of L1 cache (6 cycles), but it's fully deterministic in terms of access.
registers ok, but i want at least one megabyte of them :)
The PS3 only had 256mb of main memory so you'd be pretty limited there. Memory bandwidth, great at the time, is pretty poor by today's standards (25 gb/s)