Pleasantly surprised by the technical article :). Reminds me of a lot of precise-number-of-idle-cycles functions from embedded world in a prior life
I think volatile is usually fairly well contained I can't immediately think of an obvious case where it could adjust the performance of the calling context indirectly but its pushing the problem onto that cache coherence part of the CPU so its not quite the same as an algorithm that consumes all the memory bandwidth or execution ports. Lots of little interesting trade offs in the weeds of this and other potential solutions depending on if any of this matters.
I was looking for some script like count primes I could run