But the changes also highlights a change in focus from just implementing this naively(RDNA3 technically not too much removed from the naive raytracer I wrote) to moving it to something carefully engineered and optimized for memory bandwidth (with savings circuits even built into silicon?).
What I do wonder, like you mention that older chips could probably use the more optimized structures via software (after all, my naive-ish raytracer is fully in OpenGL and could me modified to use these structures instead), with memory being the big pain-point, what hardware optimizations/specializations are most relevant to get big gains compared to what can be done in "microcode". Circuitry for triangle-intersections, bit-unpacking but considering stack management there's probably other parts left to microcode.
You can compute a ton per bit transferred from DRAM. On both CPUs and GPUs.