For example:
$ julia
julia> function f(n)
total = 0
for x in 1:n
total += x
end
return total
end
julia> @code_native f(10)
...
sub x9, x0, #2
mul x10, x8, x9
umulh x8, x8, x9
extr x8, x8, x10, #1
add x8, x8, x0, lsl #1
sub x0, x8, #1
ret
...
it shows this with nice colors right in the REPL.In the example above, you see that LLVM figured out the arithmetic series and replaced the loop with a simple multiplication.
Though when your data is behind pointers, it's very easy to write code that the compiler can no longer figure out how to optimize.
So if you use a messy solution where something that should be a struct and operated on with functions, is actually just a pile of local variables within a single function, and you use macros operating on local variables instead of inlineable functions operating on structs, you get massively better performance.
e.g.
/* slower */
struct foo { uint32_t a,b,c,d,e,f,g,h; }
uint32_t do_thing(struct foo *foo) {
return foo->a ^ foo->b ^ foo->c ^ foo->d;
}
void blah() {
struct foo x;
for (...) {
x.e = do_thing(&x) ^ x.f;
...
}
}
/* faster */
#define DO_THING (a^b^c^d)
void blah() {
uint32_t a,b,c,d,e,f,g,h;
for (...) {
e = DO_THING ^ f;
...
}
}https://aoco.compiler-explorer.com/#g:!((g:!((g:!((h:codeEdi...
The ability of turning stack allocated variables into locals(which can be then put in registers) is one of the most important passes of modern compilers.
Since compilers use SSA, where locals are immutable while lots of languages, like C have mutable variables, some compiler frontends put locals onto the stack, and let the compiler figure out what can be put into locals and how.
I work with Cuda kernels a lot for computer vision. I am able to consistently and significantly improve on the performance of research code without any fancy tricks, just with good software engineering practices.
By organising variables into structs, improving naming, using helper functions, etc... the previously impenetrable code becomes so much clearer and the obvious optimisations reveal themselves.
Not to say there aren't certain tricks / patterns / gotchas / low level hardware realities to keep in mind, of course.
Like with people in general, it depends on what compiler/interpreter we're talking about, I'll freely grant that clang is smarter than me, but CPython for sure isn't. :)
More generally, canonicalization goes very far, but no farther than language semantics allows. Not even the notorious "sufficiently smart compiler" with infinite time can figure out what you don't tell it.
...I don't know... for instance the MSVC compiler creates this output for the last two 'non-trivial' functions with '/Ox':
add w8,w1,w0
cmp w0,#0
cseleq w0,w1,w8
Even beginner assembly coders on their first day wouldn't write such bullshit :)A better mindset is "don't trust the compiler for code that's actually performance sensitive".
You shouldn't validate each line of compiler output, but at least for the 'hot areas' in the code base that definitely pays off, because sometimes compilers do really weird shit for no good reason (often because of 'interference' between unrelated optimizer passes) - and often you don't need to dig deep to stumble over weird output like in the example above.
You can mostly not think about super low level integer manipulation stuff though.
I agree that most people are not writing hand-tuned avr8 assembly. Most people aren't attempting to do DSP on 8-bit AVRs either.
pareto principle like always, dont need the best but good enough
not every company is google level anyway
unsigned add(unsigned x, unsigned y) {
unsigned a, b;
do {
a = x & y;
b = x ^ y;
x = a << 1;
y = b;
} while (a);
return b;
}
becomes (with armv8-a clang 21.1.0 -O3) : add(unsigned int, unsigned int):
.LBB0_1:
ands w8, w0, w1
eor w1, w0, w1
lsl w0, w8, #1
b.ne .LBB0_1
mov w0, w1
retAnything HPC will benefit from thinking about how things map onto hardware (or, in case of SQL, onto data structures).
I think way too few people use profilers. If your code is slow, profiling is the first tool you should reach for. Unfortunately, the state of profiling tools outside of NSight and Visual Studio (non-Code) is pretty disappointing.
See "Example 2: Tricking the compiler" in my blog post about O3 sometimes being slower than O2: https://barish.me/blog/cpp-o3-slower/
It's super cool to see this in practice, and for me it helps putting more trust in the compiler that it does the right thing, rather than me trying to micro-optimize my code and peppering inline qualifiers everywhere.
E.g. if in `main` you called two different add functions, couldn't it optimize one of them away completely?
It probably shouldn't do that if you create a dynamic library that needs a symbol table but for an ELF binary it could, no? Why doesn't it do that?
> Perform Identical Code Folding for functions (-fipa-icf-functions), read-only variables (-fipa-icf-variables), or both (-fipa-icf). The optimization reduces code size and may disturb unwind stacks by replacing a function by an equivalent one with a different name. The optimization works more effectively with link-time optimization enabled.
In addition, the Gold linker supports a similar feature via `--icf={safe,all}`:
> Identical Code Folding. '--icf=safe' Folds ctors, dtors and functions whose pointers are definitely not taken
If you declare them as static, it eliminates the functions and the calls completely: https://aoco.compiler-explorer.com/z/soPqe7eYx
I'm sure it could also perform definition merging like you suggest but I can't think of a way of triggering it at the moment without also triggering their complete elision.
It can't do that because the program might load a dynamic library that depends on the function (it's perfectly OK for a `.so` to depend on a function from the main executable, for example).
That's one of the reasons why a very cheap optimization is to always use `static` for functions when you can. You're telling the compiler that the function doesn't need to be visible outside the current compilation unit, so the compiler is free to even inline it completely and never produce an actual callable function, if appropriate.
That makes perfect sense, thank you!
And I just realized why I was mistaken. I am using fasm with `format ELF64 executable` to create a ELF file. Looking at it with a hex editor, it has no sections or symbol table because it creates a completely stripped binary.
Learned something :)
I get it though, because carefully structuring your #includes to get a single translation unit is messy, and compile times get too long.
A quick google suggests it's called "identical comdat folding" https://devblogs.microsoft.com/oldnewthing/20161024-00/?p=94...
void go_forward(Closure *clo, Closure *cont, Closure *forward) {
GC_CHECK(clo, cont, forward);
((Fun0)(forward->fun))(forward, cont);
}
void go_left(Closure *clo, Closure *cont, Closure *left, Closure *right) {
GC_CHECK(clo, cont, left, right);
((Fun0)(left->fun))(left, cont);
}
void go_right(Closure *clo, Closure *cont, Closure *left, Closure *right) {
GC_CHECK(clo, cont, left, right);
((Fun0)(right->fun))(right, cont);
}
GcInfo gc_info[] = {
{ .fun = (GenericFun)&go_forward, .envc = 0, .argc = 1 },
{ .fun = (GenericFun)&go_left, .envc = 0, .argc = 2 },
{ .fun = (GenericFun)&go_right, .envc = 0, .argc = 2 },
};
Since, the pointers to go_forward and go_left will be the same, the gc_info table is less useless that it could be otherwise.You absolutely can fool a lot of compilers out there! And I am not only looking at you, NVCC.