Everything was slow because sorting was taking a lot of time. Sorting was slow because its comparator was taking ~6 read locks on every comparison, and was cloning large structures to avoid holding the lock for a long time. The first fix was to access just the information needed to avoid the clones, the second fix was to cache exactly the data needed for sorting after the underlying data was updated, and use that for the comparators without needing to take the underlying lock.
I'm looking forward to the next post about how cache consistency is tough.
This whole thing is summed up by some pretty basic physics. What you actually want to minimize is the communication of information between physical cores. Nothing else really matters. Certainly not terminology or clever tricks that effectively try to cheat thermodynamics. The cost of communicating information is almost always much more substantial than the cost of computing over that same information. The ALU is not nearly as power hungry as infinity fabric.
It seems all of these Rust performance warriors have no understanding of computer architecture fundamentals.