Also, the benchmark is clock-for-clock, so while the older Phenom II looks like it's ahead, the Buldozer should be able to go faster still.
All that said, I really enjoyed this retrospective look.
And perhaps most importantly: 4x decoders/4x L1 iCache. IIRC, the entire damn chip was decoder-bound.
--------
Note: AMD Zen has 4x Integer pipelines and 4x FPU pipelines __PER CORE__. Modern high-performance systems CANNOT have a single 2x-pipeline FPU shared between two cores (averaging one pipeline per core). Modern Zen is closer to 4x pipelines per core, maybe more depending on how you count load/store units.
Shrinking the decoder on Bulldozer was clearly the wrong move for Fx-series / AMD. Today's chips are going wide decoder (ex: Apple can do 8x decode per clock tick), deep opcode cache (AMD Zen has a large opcode cache allowing for 6x way lookup per clocktick), or Intel's new and interesting multiple-decoder thing.
> Leapfrogging fetch and decode clusters have been a distinguishing feature of Intel’s E-Core line ever since Tremont. Skymont doubles down by adding another decode cluster, for a total of three clusters capable of decoding a total of nine instructions per cycle.
They want you to write code that takes advantage of their speedups. Agner Fog is a better writer (a sibling comment already linked to Agner Fogs stuff). But I also like referencing the official manuals and whitepapers as a primary source document.
Hard to beat Intels documents on Intel chips after all.
FX cores had his issues. But one, was the AMD bet too early, and too hard that the future was to have a high number of cores.
You can easily see the multithreaded workloads there because you have the six core 3960X as comparison too.
It's almost 10 years old, so I can't complain. And I think I got a check for $2 or something like that from the class-action suit.