Imagine all projects were similarly committed.
But to anyone complaining, I want to know, when was the last you pulled out a profiler? When was the last time you saw anyone use a profiler?
People asking for performance aren't pissed you didn't write Microsoft Word in assembly we're pissed it takes 10 seconds to open a fucking text editor.
I literally timed it on my M2 Air. 8s to open and another 1s to get a blank document. Meanwhile it took (neo)vim 0.1s and it's so fast I can't click my stopwatch fast enough to properly time it. And I'm not going to bother checking because the race isn't even close.
I'm (we're) not pissed that the code isn't optional, I'm pissed because it's slower than dialup. So take that Knuth quote you love about optimization and do what he actually suggested. Grab a fucking profiler, it is more important than your Big O
The enterprising hacker then wrote a simple binary patch that reduced the startup time from 5-10 minutes to like 15 seconds or something.
To me that's profound. It implies that not only was management not concerned about the start up time, but none of the developers of the project ever used a profiler. You could just glance at a flamegraph of it, see that it was a single enormous plateau of a function that should honestly be pretty fast, and anyone with an ounce of curiousity would be like, ".........wait a minute, that's weird." And then the bug would be fixed in less time than it would take to convince management that it was worth prioritizing.
It disturbs me to think that this is the kind of world we live in. Where people lack such basic curiosity. The problem wasn't that optimization was hard, (optimization can be extremely hard) it was just because nobody gave a shit and nobody was even remotely curious about bad performance. They just accepted bad performance as if that's just the way the world is.
[0] Oh god it was 4 years ago: https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...
How is it that these companies spend millions of dollars to develop games and yet modders are making patches in a few hours fixing bugs that never get merged. Not some indie game, but AAA rated games!
I think you're right, it's on both management and the programmers. Management only knows how to rush but not what to rush. The programmers fall for the trap (afraid to push back) and never pull up a profiler. Maybe over worked and over stressed but those problems never get solved if no one speaks up and everyone is quiet and buys into the rush for rushing's sake mentality.
It's amazing how many problems could be avoided by pulling up a profiler or analysis tool (like Valgrind).
It's amazing how many millions of dollars are lost because no one ever used a profiler or analysis tool.
I'll never understand how their love for money makes them waste so much of it.
> by the desire to
An appropriate choice of words.I'm just wondering if/when anyone will realize that often desire gets in the way of achieving. They may be penny wise but they're pound foolish.
To be fair, this is because they mostly care about serving ads. Without the ads, the pages are often fine.
People argue "sure, it's not optimal, but it's good enough". But that compounds. A little slower each time. A little slower each application. You test on your VM only running your program.
But all of this forgets what makes software so powerful AND profitable: scale. Since we always need to talk monetary value, let's do that. Shaving off a second isn't much if it's one person or one time but even with a thousand users that's over 15 minutes, per usage. I mean we're talking about a world where American Airlines talks about saving $40k/yr by removing an olive and we don't want to provide that same, or more(!), value to our customers? Let's say your employee costs $100k/yr and they use that program once a day. That's 260 seconds or just under 5 minutes. Nothing, right? A measly $4. But say you have a million users. Now that's $4 million!
Now, play a fun game with me. Just go about your day as normal but pay attention to all those little speedbumps. Count them as $1m/s and let me know what you got. We're being pretty conservative here as your employee costs a lot more than their salary (2-3x) and we're ignoring slowdown being disruptive and breaking flow. But I'm willing to bet in a typical day you'll get on the order of hundreds of millions ($100m is <2 minutes).
We solve big problems by breaking them into a bunch of smaller problems, so don't forget that those small problems add up. It's true even if you don't know what big problem you're solving.
Users want to load and edit PDFs. Finnish has been rendering right to left for months, but the easy fix will break Hebrew. The engineers say a new rendering engine is critical or these things will just get worse. Sales team says they’re blocked on a significant contract because the version tracking system allows unaudited “clear history” operations. Reddit is going berserk because the icon you used (and paid for!) for the new “illuminated text mode” turns out to be stolen from a Lithuanian sports team.
Knowing that most of your users only start the app when their OS forces a reboot… just how much priority does startup time get?
Fuck this "we don't need to optimize" bullshit. Fuck this "minimum viable product" bullshit. It's just a race to the bottom. No one paper cut is the cause of death, but all of them are when you have a thousand.
Then you agree with the poster. Performance critical software should focus on performance.
Any concrete examples where we can see the code?
Programming is a small piece of a larger context. What makes a program "good" is not a property of the program itself, but measured by external ends and constraints. This is true of all technology. Some of these constraints are resources, and one of these resources is time. In fact, the very same limitation on time that motivates the prioritization of development effort toward some features other than performance is the very same limitation that motivates the desire for performance in the first place.
Performance must be understood globally. Let's say we need a result in three days, and it takes two days to write a program that takes one day to get the result, but a week to write a program that takes a second to produce a result, then obviously, it is better to write the program the first way. In a week's time, your fast program will no longer be needed! The value of the result will have expired.
This is effectively a matter of opportunity cost.
Sadly lots of software is blatantly wasteful. But it doesn't take fancy assembly micro optimization to fix it, the problem is typically much higher level than that. It's more like serialized network requests, unnecessarily high time complexities, just lots of unnecessary work and unnecessary waiting.
Once you have that stuff solved you can start looking at lower level optimization, but by that point most apps are already nice and snappy so there's no reason to optimize further.
ffmpeg was however, always the best open-source project, basically because it had all the smart developers who were capable of collaborating on anything. Its competition either wasn't smart enough and got lost in useless architecture-astronauting[2], or were too contrarian and refused to believe their encoder quality could get better because they designed it based on artificial PSNR benchmarks instead of actually watching the output.
[0] For complicated reasons I don't fully understand myself, audio encoders don't get quality improvements by sharing code or developers the way decoders do. Basically because they use something called "psychoacoustic models" which are always designed for the specific codec instead of generalized. It might just be that noone's invented a way to do it yet.
[1] I eventually fixed this by writing a new multithreading system, but it took me ~2 years of working off summer of code grants, because this was before there was much commercial interest in it.
[2] This seems to happen whenever I see anyone try to write anything in C++. They just spend all day figuring out how to connect things to other things and never write the part that does anything?
> They just spend all day figuring out how to connect things to other things and never write the part that does anything?
I see a lot of people write software like this regardless of language. Like their job is to glue pieces of code together from stack overflow. Spending more time looking for the right code that kinda sorta works than it would take to write the code which will just work.https://x.com/FFmpeg/status/1775178803129602500
https://x.com/FFmpeg/status/1856078171017281691
https://x.com/FFmpeg/status/1950227075576823817
Oh, and here's one making fun of HN comments. Hi ffmpeg :) https://x.com/FFmpeg/status/1947076489880486131
Once the competition fails, the value extraction process can begin. This is where the toxicity of our city begins to manifest. Once there is no competition remaining we can begin eating seeds as a pastime activity.
The toxicity of our city; our city. How do you own the world? Disorder.
Disorder…
They publish doxygen generated documentation for the APIs, available here: https://ffmpeg.org/doxygen/trunk/
* To be more precise, these are bindings for the libav* libraries that underlie ffmpeg
The few chapters I saw seemed to be pretty generic intro to assembly language type stuff.
Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly?
> Would it ever make sense to write handwritten compiler intermediate representation like LLVM IR instead of architecture-specific assembly?
IME, not really. I've done a fair bit of hand-written assembly and it exclusively comes up when dealing with architecture-specific problems - for everything else you can just write C (unless you hit one of the edge cases where C semantics don't allow you to express something in C, but those are rare).
For example: C and C++ compilers are really, really good at writing optimized code in general. Where they tend to be worse are things like vectorized code which requires you to redesign algorithms such that they can use fast vector instructions, and even then, you'll have to resort to compiler intrinsics to use the instructions at all, and even then, compiler intrinsics can lead to some bad codegen. So your code winds up being non-portable, looks like assembly, and has some overhead just because of what the compiler emits (and can't optimize). So you wind up just writing it in asm anyway, and get smarter about things the compiler worries about like register allocation and out-of-order instructions.
But the real problem once you get into this domain is that you simply cannot tell at a glance whether hand written assembly is "better" (insert your metric for "better here) than what the compiler emits. You must measure and benchmark, and those benchmarks have to be meaningful.
perf is included with the Linux kernel, and works with a fair amount of architectures (including Arm).
The factors are something like:
- specialization: there's already a decent plain-C implementation of the loop, asm/SIMD versions are added on for specific hardware platforms. And different platforms have different SIMD features, so it's hard to generalize them.
- predictability: users have different compiler versions, so even if there is a good one out there not everyone is going to use it.
- optimization difficulties: C's memory model specifically makes optimization difficult here because video is `char *` and `char *` aliases everything. Also, the two kinds of features compilers add for this (intrinsics and autovectorization) can fight each other and make things worse than nothing.
- taste: you could imagine a better portable language for writing SIMD in, but C isn't it. And on Intel C with intrinsics definitely isn't it, because their stuff was invented by Microsoft, who were famous for having absolutely no aesthetic taste in anything. The assembly is /more/ readable than C would be because it'd all be function calls with names like `_mm_movemask_epi8`.
Not really. There are a couple of reasons to reach for handwritten assembly, and in every case, IR is just not the right choice:
If your goal is to ensure vector code, your first choice is to try slapping explicit vectorize-me pragmas onto the loop. If that fails, your next effort is either to use generic or arch-specific vector intrinsics (or jump to something like ISPC, a language for writing SIMT-like vector code). You don't really gain anything in this use case from jumping to IR, since the intrinsics will satisfy your code.
If your goal is to work around compiler suboptimality in register allocation or instruction selection... well, trying to write it in IR gives the compiler a very high likelihood of simply recanonicalizing the exact sequence you wrote to the same sequence the original code would have produced for no actual difference in code. Compiler IR doesn't add anything to the code; it just creates an extra layer that uses an unstable and harder-to-use interface for writing code. To produce the best handwritten version of assembly in these cases, you have to go straight to writing the assembly you wanted anyways.
You could invent a DSL for writing the kernels in… but they did, it's x86inc.asm. I agree ispc is close to something that could work.
On startup, it runs cpuid and assigns each operation the most optimal function pointer for that architecture.
In addition to things like ‘supports avx’ or ‘supports sse4’ some operations even have more explicit checks like ‘is a fifth generation celeron’. The level of optimization in that case was optimizing around the cache architecture on the cpu iirc.
Source: I did some dirty things with chromes native client and ffmpeg 10 years ago.
https://github.com/FFmpeg/FFmpeg/blob/master/libavutil/x86/x...
It's glorious.