https://stackoverflow.com/questions/33902068/what-setup-does...
Non-temporal stores are tricky performance wise. They can be dramatically faster than normal stores (~3x), they may be faster on some generations of CPUs than others, they may be slower if subsequent code needs the destination in the CPU cache, and even for GPUs they may not be ideal if an iGPU is sharing part of the cache hierarchy with the CPU. But the worst issue is that occasionally a specific CPU will have some random pathological behavior with them. IIRC, masked non-temporal stores were horrifically slow on some AMD APUs, on the order of hundreds to thousands of cycles per instruction. I find it hard to recommend them much anymore.
it is never used with a prefix (the value would be overwritten for each repetition)
...which is still useful for extreme size-optimisation; I remember seeing "rep lodsb" in a demo, as a slower-but-tiny (2 bytes) way of [1] adding cx to si, [2] zeroing cx, [3] putting the byte at [cx + si - 1] into al, and [4] conditionally leaving al and si unchanged if cx is 0, all effectively as a single instruction. Not something any optimising compiler I know of would be able to do, but perhaps within the possibility of an LLM these days.
vpcmpestri xmm2, xmm3, BYTEWISE_CMP
test cx, 0x10 ; if(rcx != 16)
I see this test/cmp all the time after the instruction and I don't understand it. pcmpestri will set ZF if edx < 16, and it will set SF if eax < 16. It is already giving you the necessary status. Also testing sub words of the larger register is very slow and is a pipeline hazard.You've got this monster of an instruction and then people place all this paranoid slowness around it. Am I reading the x86 manual wrong?
But on any modern CPU there should be essentially no penalty for doing that now. Testing the full register is basically free as long as you aren't doing a partial write followed by a full read (write AH then read AX), and I don't think there's any case where this could stall on anything newer than a Core 2 era processor. But just replacing that with a "jnc" or whatever you're exactly trying to test for would be less instructions at least. I'd love to see benchmarks though if someone has dug deeper into this than I have.
But yeah, it may not make a real impact yet anyway.