Lessons learned from profiling an algorithm in Rust
153 points
5 days ago
| 9 comments
| blog.mapotofu.org
| HN
pcwalton
5 days ago
[-]
I'm guessing that f32::clone showing up in the profile isn't actually a call to f32::clone, because you have optimizations on (if it actually is a call to a "movd xmm0,dword ptr [rdi]; ret" instruction pair, that's a bug in the compiler). Rather it's the result of the compiler choosing to attribute seemingly-random lines to f32::clone, because when lines from multiple functions are fused into one instruction the compiler will just pick one, and it happened to pick f32::clone to write into the debug info. You really want to look at instruction-level profiling when you're profiling at that level instead of the individual functions, since debug info is going to be very unreliable.
reply
eftychis
5 days ago
[-]
Seconded. This could have been essentially anything and everything else bunched together. Or we have a compiler or debug symbol bug in our hands.
reply
urcyanide
4 days ago
[-]
I also see the profiling shows that the "iter::next" takes a large percentage in the flamegraph. Are they the same reason?
reply
wrs
5 days ago
[-]
I’m a Rust newbie, wondering how f32::clone could show up in a profile. Wouldn’t that be an inline no-op under any kind of optimization? I mean, cloning a float is, at worst, a MOV instruction, no?
reply
skavi
4 days ago
[-]
Profiles aren’t always entirely accurate. That’s kinda fine, since (IMO) they’re mainly useful for pointing you in the correct direction.

Once you’ve found a small enough chunk of logic that’s worth optimizing, I’d recommend relying more on benchmarks and disassembly.

reply
MiguelX413
5 days ago
[-]
Floats aren't stored in the same kinds of registers.
reply
epcoa
4 days ago
[-]
How would that make any difference to what is being discussed?
reply
achierius
4 days ago
[-]
fmov is an instruction -> not a no-op
reply
epcoa
4 days ago
[-]
The article is doing profiling on x86-64. There is no FMOV. Floating point on x86-64 is done in the SSE or AVX (xmm, ymm) registers loaded with MOVD/Q which has basically the same addressing modes as any of the usual GPR loads/moves. In any case in optimized code, something like a scalar clone is inlined and would not be a consistent accountable and localizable operation (and may even be optimized away).
reply
wrs
3 days ago
[-]
My point is, whatever it is, it can't possibly be a function call, so how would it ever legitimately show in a profile? I like pcwalton's theory.

The practical implication is that trying to optimize "calls to f32::clone" because they're in a profile is a wild goose chase -- unless you have debug anti-optimization turned on, there shouldn't be any calls to f32::clone.

reply
andrewaylett
5 days ago
[-]
That's really interesting -- I do enjoy a good optimisation.

I was looking at one of the diffs, and thinking at a sufficiently advanced compiler should be able to generate the same efficient code for both -- and indeed it does, if you turn the optimiser on: https://godbolt.org/z/hjP5qjabz

  - let shift = if (i / 32) % 2 == 0 { 32 } else { 0 };
  + let shift = ((i >> 5) & 1) << 5;
reply
NovaX
5 days ago
[-]
I'm confused because isn't the bitwise version the inverted logic? If the LSB is 1 then it is an odd value, which should be zero, yet that is shifted to become 32. The original modulus is for an even value becoming 32. Shouldn't the original code or compiler invert it first? I'd expect

    let shift = ((~(i >> 5) & 1) << 5);
EDIT: The compiler uses "vpandn" with the conditional version and "vpand" with the bitwise version. The difference is it includes a bitwise logical NOT operation on the first source operand. It looks like the compiler and I are correct, the author's bitwise version is inverted, and the incorrect code was merged in the author's commit. Also, I think this could be reduced to just (~i & 32).
reply
andrewaylett
4 days ago
[-]
Even more interesting :).

From my days as a junior member of a team developing a compiler and run-time libraries, I really like the approach we took there: if the compiler generated sub-optimal code for a straightforward implementation, we'd aim to fix the compiler instead of tweaking the code. That's more difficult if you're not already maintaining your own compiler, of course. And algorithmic improvements are still valuable.

In this case, the optimiser already generated efficient code. Makes me wonder if any observed speed-up might have been because the incorrect code needed to do less work?

reply
urcyanide
4 days ago
[-]
Thanks for pointing out. I have updated the post. I use the opposite shift since I store the binary in u64 with a different endian from the C++ version. Sorry for the confusion, my bad.
reply
NovaX
4 days ago
[-]
Endian indicates the byte order (MSB->LSB, LSB->MSB), but does not change the representation of a bit. The conditional logic says that an even result is 32 and an odd is 0 after a modulus two. In binary that (x & 1) is 0 for even and 1 for odd. When you shift by 5 that is 0 for even and 32 for odd, which is the opposite of the conditional logic. This is why I suggested using a binary NOT to flip the bits so that you get the same result as the original.
reply
urcyanide
5 days ago
[-]
Thanks for pointing out, this can be optimized by the compiler when enabling opt-level=3
reply
carlmr
5 days ago
[-]
Great writeup with easy to understand steps. One thing it's lacking though is in the conclusion. I'd like to see a comparison to the C++ implementation.
reply
skavi
4 days ago
[-]
https://parallel-rust-cpp.github.io/

This goes through seven iterations of optimization an algorithm in rust, comparing it to the equivalent c++ at each stage.

reply
efnx
5 days ago
[-]
Yes, exactly. How close does it come after all those optimisations?
reply
urcyanide
5 days ago
[-]
It has the same QPS as the C++ version for GIST dataset. While Rust has more SIMD, C++ has const generic. I guess there is still some space for future improvement.
reply
carlmr
4 days ago
[-]
Oh nice, the article already changed.

I can't quite imagine it's the exact same though in all metrics. Some graphs of measurement statistics would be even better.

reply
mwkaufma
5 days ago
[-]
I don't understand why half of these aren't optimized by the compiler automatically. (x - y).norm_squared()? Why is f32::clone() not just an inline mov? Begging a lot of questions.
reply
JackYoustra
5 days ago
[-]
I've previously had problems with the compiler not inlining / eliding instructions solely due to profiling code (see a blog post: https://www.jackyoustra.com/blog/llama-ios#-bug-bug-slowdown...). I wonder if it's that?

(I've also always had a sneaking suspicion I did something wrong in my example, so if anyone knows let me know)

reply
vlovich123
5 days ago
[-]
Pcwalton’s explanation is much more likely to be correct https://news.ycombinator.com/context?id=41830704

Profiling native code with optimizations on is very very tricky.

reply
mwkaufma
4 days ago
[-]
What's the point of profiling for performance without optimizations if it leads you where the OP winds up hand rolling the same transformations?
reply
JackYoustra
4 days ago
[-]
Usually a terrible idea (imo) to profile without optimizations, I think the point is profiling with optimizations is tricky (without much better alternative).
reply
JackYoustra
4 days ago
[-]
Agree, seems like a more likely culprit in this case.
reply
agentultra
4 days ago
[-]
> embedded Rust (lots of FFI & unsafe)

How much? And did the parts in safe Rust make up/protect the unsafe parts?

I’d be concerned that the reason there are fewer errors is because of the experience the team already had with the existing system. Porting or rewriting it would allow them to avoid many of the errors that were already fixed in the C implementation and errors they knew about ahead of time… assuming there’s more unsafe/FFI than safe.

reply
larsrc
4 days ago
[-]
Interesting read. How much speedup did you get in total? Steps of improvements don't always add up directly.
reply
urcyanide
4 days ago
[-]
The end2end QPS is about 3x of the 1st Rust version on both SIFT & GIST datasets.
reply
fithisux
1 day ago
[-]
The fact that C++ does not force one to use/learn SIMD is a big advantage for C++ (not a fan of C++, but a fan of D).

What I would like to know personally is how complex is each code base.

Complex is nebulous so some metrics like

line counts or other code quality metrics would help.

reply
chrismorgan
4 days ago
[-]

  - let shift = if (i / 32) % 2 == 0 { 32 } else { 0 };
  + let shift = ((i >> 5) & 1) << 5;
Uh, that’s just `let shift = i & 32;`, right? Much easier to read, too, in my opinion.

(Edit: fixed from i % 32 to i & 32.)

reply
NovaX
4 days ago
[-]
I believe it is (!i & 32) because the bitwise version is an incorrect rewrite.

[1] https://news.ycombinator.com/item?id=41830016

reply
urcyanide
4 days ago
[-]
Sorry for the confusing example. The bitwise one is correct since I store the binary in u64 with a different endian from the C++ version. (this happens because the C++ version is using a numpy script to do the preprocessing) My bad, I should explain it in a better way. Will update the post.
reply
durumu
4 days ago
[-]
I think it's i & 32. Agreed on that being more readable.
reply
urcyanide
4 days ago
[-]
Yes, i&32 is more readable.
reply
urcyanide
4 days ago
[-]
Not really, i \in (0..dim).step_by(32), shift \in {0, 32}, but for `i % 32` it's 0..32
reply
chrismorgan
4 days ago
[-]
Sorry, I wrote `i % 32` by accident, I meant `i & 32`.
reply