In this situation the C programmers can either a) accept that they're programming in a language that exists as it exists, not as they'd like it to exist; b) angrily deny a); or c) switch to some other system-level language with defined semantics.
I suspect it also depends on who exactly the compiler writers are; the GCC and LLVM guys seem to have more theoretics/academics and thus think of the language more abstractly, leading to UB being truly inexplicable and free of thought, while MSVC and ICC are more on the practical side and their interpretation of it is, as the standard says, "in a documented manner characteristic of the environment". IMHO the "spirit of C" and the more commonsense approach is definitely the latter, and K&R themselves have always leaned in that direction. This is very much a "letter of the law vs. spirit of the law" argument. The fact that these two different sides have produced compilers with nearly the same performance characteristics shows IMHO that the argument of needing to exploit UB is mandatory for performance is a debunked myth.
If not, then, like ... sure, C compiler maintainers people who program in C, but they're not "C programmers" as it was intended (people who develop non-compiler software in C).
My hunch is that that statement is overwhelmingly true if measured by influence of a given C compiler/implementation stack (because GCC/LLVM/MSVC take up a huge slice of the market, and their maintainers are in many cases paid specialists who don't do significant work on other projects), but untrue if measured by count of people who have worked on C compilers (because there are a huge number of small-market-share/niche compilers out there, often maintained by groups who develop those compilers for a specific, often closed-source, platform/SoC/whatever).
[0] https://blog.regehr.org/archives/1287
> In contrast, we want old code to just keep working, with latent bugs remaining latent.
Well, just keep compiling it with the old compilers. "But we'd like to use new compilers for some 'free' gains!" Well, sucks, you can't. "But we have to use new compilers because the old ones just plain don't work on the newer systems!" Well, that sucks, and this here is why "technical debt" is called "debt" and you've managed to hold paying it off until now the repo team is here and knocking at your door.
I mostly work in compiled languages now, but started in interpreted/runtime languages.
When I made that switch, it was baffling to me that the compiled-language folks don't do compatibility-breaking changes more often during big language/compiler revision updates.
Compiled code isn't like runtime code--you can build it (in many cases bit-deterministically!) on any compiler version and it stays built! There's no risk of a toolchain upgrade preventing your software from running, just compiling.
After having gone through the browser compatibility trenches and the Python 2->3 wars, I have no idea why your proposal isn't implemented more often: old compiler/language versions get critical/bugfix updates where practical, new versions get new features and aggressively deprecate old ones. For example: "you want some combination of {the latest optimizations, loongarch support, C++-style attributes, #embed directives, auto vector zero-init}? Great! Those are only available on the new revision of the compiler where -Werror is the default and only behavior. Don't want those? The old version will still get bugfixes."
Don't get me wrong, backwards compatibility is golden...when it comes to making software run. But I think it's a mistake that back compat is taken even further when it comes to compilers, rather than the reverse. I get that there are immense volumes of C/C++ out there, but I don't get why new features/semantics/optimizations aren't rolled out more aggressively (well, I do--maintainers of some of those immense volumes are on language steering committees and don't want to spin up projects to modernize their codebases--but I'm mad about it).
"Just use an old compiler" seems like such a gimme--especially in the modern era of containers etc. where making old toolchains available is easier than ever. I get that it feels bad and accumulates paper cuts, but it is so much easier to deploy compiled code written on an old revision on a new system than it is to deploy interpreted/managed code.
(There are a few cases where compilers need to be careful there--thinking about e.g. ELF format extensions and how to compile code with consideration for more aggressive linker optimizations that might be developed in the future--but they're the minority.)
I know it’s not pleasant per se, but the level of support needed (easier now with docker and better toolchain version management utils than were the norm previously) surely doesn’t merit compilers carrying around the volume of legacy cruft and breaking-change aversion they do, no?
Contrast this with Linus' famous "we do not break userspace" rant which is the polar opposite of the gcc devs "we love to break your code to show how much cleverererer than you we are". Just for reference the exact quote, https://lkml.org/lkml/2012/12/23/75, is:
And you *still* haven't learnt the first rule of kernel maintenance? If a change results in user programs breaking, it's a bug in the kernel. We never EVER blame the user programs. How hard can this be to understand? ... WE DO NOT BREAK USERSPACE!
Ah, Happy Fun Linus. Can you imagine the gcc devs ever saying "if we break your code it's a problem with gcc" or "we never blame the user?".This really seems to be gcc-specific problem. It doesn't affect other compilers like MSVC, Diab, IAR, Green Hills, it's only gcc and to a lesser extent clang. Admittedly this is from a rather small sample but the big difference between those two sets that jumps out is that the first one is commercial with responsibilities to customers and the second one isn't.
I think that GCC changed a bit in recent years, but I am also not sure that an optimizing compiler can not have the same policy as the kernel. For the kernel, it is about keeping API's stable which is realistic, but an optimizing compiler inherently relies on some semantic interpretation of the program code and if there is a mismatch that causes something to break it is often difficult to fix. It is also that many issues were not caused because they decided suddenly "let's now exploit this UB we haven't exploited before" but that they always relied on it but an improved optimization now makes something affect more or different program. This creates a difficult situation because it is not clear how to fix it if you don't want to roll back the improvement you spend a lot of time on and others paid for. Don't get me wrong, I agree the went to far in the past in exploiting UB, but I do think this is less of a problem when looking forward and there is also generally more concern about the impact on safety and security now.
I think a lot of the UB though isn't "let's exploit UB", it's "we didn't even know we had UB in the code". An example is twos-complement arithmetic, which the C language has finally acknowledged more than half a century after the last non-twos-complement machine was built (was the CDC 6600 the last one's-complement machine? Were most of the gcc dev even born when that was released?). So everyone on earth has been under the crazy notion that their computer used twos-complement maths which the gcc (and clang) devs know is actually UB and allows them to do whatever they want with your code when they encounter it.
If you don't please your users, you won't have any users.
By any metric, C++ is one of the most successful programming languages devised by mankind, if not the most successful.
What point were you trying to make?
I think claiming that C++ is successful because of the unintuitive-behavior-causing compiler behaviors/parts of the spec is an extraordinary claim--if that's what you mean, then I disagree. TFA discusses that many of the most pernicious UB-causing optimizations yield paltry performance gains.
Back in the 80s, I was looking for a way to enhance my C compiler. I looked at Objective-C and C++. There was a newsgroup for each, and each had about the same amount of traffic. I had to pick one.
Objective-C required a license to implement it. I asked AT&T if I needed a license to implement C++, and could I call it C++. AT&T's lawyer laughed and said feel free to do whatever you want.
So that decided it for me. At the time, C++ did not exist on the PC other than the awkward, nearly unusable cfront (which translated C++ to C). At the time, 90% of programming was done on the PC.
I implemented it. It was the first native C++ compiler for the PC. (It is arguable that it was the first native C++ compiler, depending on whether a gcc beta is considered a release.)
The usage of it exploded. The newsgroup traffic for C++ zoomed upwards, and Objective-C interest fell away. C++ built critical mass because of Zortech C++.
Borland dropped their plans for an OOP language and went for Turbo C++. Microsoft also had a secret OOP C language called C*, which was also abandoned in favor of implementing C++.
And the rest is history!
P.S. cfront on the PC was unusable because it was 1) incredibly slow and 2) did not support near/far pointers which was required for the mixed PC memory models.
P.P.S. Bjarne Stroustrup never mentioned any of this in his book "The Design and Evolution of C++".
Nowadays, UB means something completely different - if at any point in time, the compiler reasons out that a piece of code is only reachable via UB, it will assume that this can never happen, and will quietly delete everything downstream:
As in, everything down from UB is only working by an accident of implementation that does not need to hold, and you should explicitly not rely on that. Whether the compiler happens to explicitly make it not ever work or just leaves it to fate should not be relevant.
UB just ment "the spec doesn't define what happens". It didn't use to mean "the compiler can just decide to do any wild thing if your program touches UB anywhere at anytime". Hell, with the modern definition UB can aparantly time travel. you don't even need to execute UB code for it to start doing weird shit in some cases.
UB went from "whatever happens when your compiler/hardware runs this is what happens" to "Once a program contains UB the compiler doesn't need to conform to the rest of the spec anymore."
>UB just ment "the spec doesn't define what happens"
What comes to mind is that then the written code is operating on a subspec, one that is probably undocumented and maybe even unintended by the specifics of that version and platform.
It sounds like it could create a ton of issues, from code that can’t be ported to difficulty in other person grokking the undocumented behavior that is being used.
In this regard, as someone that could potentially inherit this code I’d actually want the compiler to stop this potential behavior. Am I missing something? Is the spec not functional enough on its own to rely just on that?
int handle_untrusted_numbers(int a, int b) {
if (a < 0) return ERROR_EXPECTED_NON_NEGATIVE;
if (b < 0) return ERROR_EXPECTED_NON_NEGATIVE;
int sum = a + b;
if (sum < 0) {
return ERROR_INTEGER_OVERFLOW;
}
return do_something_important_with(sum);
}
Every computer you will ever use has two's complement for signed integers, and the standard recently recognized and codified this fact. However, the UB fanatics (heretics) insisted that not allowing signed overflow is an important opportunity for optimizations, so that last if-statement can be deleted by the compiler and your code quietly doesn't check for overflow any more.There are plenty more examples, but I think this is one of the simplest.
It's some of the most user-hostile behavior I've ever encountered in an application.
The three behaviours relevant in this discussion, from section 3.4:
3.4.1 implementation-defined behavior
unspecified behavior where each implementation documents how the choice is made
EXAMPLE An example of implementation-defined behavior is the propagation of the high-order bit when a signed integer is shifted right.
3.4.3 undefined behavior
behavior, upon use of a nonportable or erroneous program construct or of erroneous data, for which this International Standard imposes no requirements
Possible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message), to terminating a translation or execution (with the issuance of a diagnostic message).
An example of undefined behavior is the behavior on integer overflow.
3.4.4 unspecified behavior
behavior where this International Standard provides two or more possibilities and imposes no further requirements on which is chosen in any instance
An example of unspecified behavior is the order in which the arguments to a function are evaluated.
K&R seems to also mention "undefined" and "implementation-defined" behaviour on several occasions. It doesn't specify what is meant by undefined behaviour, but it does indeed seem to be "whatever happens, happens" instead of "you can do whatever you want." But ISO C99 seems to be a lot looser with their definition.Using integer overflow, as in your example, for optimization has been shown to be beneficial by Charles Carruth in a talk he did at CppCon in 2016.[1] I think it would be best to have something similar to Zig's wrapping and saturating addition operators instead, but for that I think it is better to just use Zig (which I personally am very willing to do once they reach 1.0 and other compiler implementations are available).[2]
[1] https://youtu.be/yG1OZ69H_-o?si=x-9ALB8JGn5Qdjx_&t=2357 [2] https://ziglang.org/documentation/0.15.2/#Operators
It's also worth noting that even with the current very liberal handling of UB, the actual code sample in [1] was still lacking this optimization; so it's not like the liberal UB handling automatically lead to faster code, understanding of the compiler was still needed.
The question is one of risk - if the compiler is conservative, you're risking is a slightly more unoptimized code. If the compiler is very liberal and assumes UB never happens, you're risking that it will wipe your overflow check like in my godbolt (I've seen an actual CVEs due to that, although I don't remember the project)
What every compiler writer should know about programmers (2015) [pdf] - https://news.ycombinator.com/item?id=19659555 - April 2019 (62 comments)
What every compiler writer should know about programmers [pdf] - https://news.ycombinator.com/item?id=11219874 - March 2016 (106 comments)
https://www.yodaiken.com/2021/05/19/undefined-behavior-in-c-...
And here's a cautionary tale of how a compiler writer doing whatever they wish once they encounter undefined behavior makes debugging intractable:
https://www.quora.com/What-is-the-most-subtle-bug-you-have-h...
By their own admission, the compiler warns about the UB. "-Wanal"¹, as some call it, makes it an error. Under UBSan the program aborts with:
code.cpp:4:6: runtime error: execution reached the end of a value-returning function without returning a value
… "intractable"?¹a humorous name for -Wextra -Wall -Werror
The -Werror flag is not even religiously used for building, e.g. the linux kernel, and -Wextra can introduce a lot of extraneous garbage.
This will often make it easier (though still difficult) to winnow the program down to a smaller example, as that person did, rather than to enable everything and spend weeks debugging stuff that isn't the actual problem.
Yeah, I know it breaks the common illusion among the C programmers that they're "close to the bare metal", but illusions should be dispersed, not indulged. The C programmers program for the abstract C machine which is then mediated by the C compilers into machine code the way the implementers of C compilers publicly documented.
Moreover, compiler authors don't just go out maliciously trying to ruin programs through finding more and more torturous undefined behavior for fun: the vast majority of undefined behavior in C are things that if a compiler wasn't able to assume were upheld by the programmer would inhibit trivial optimizations that the programmer also expects the compiler to be able to do.
That is to say, I find "could not happen" the most bizarre reading to make when optimizing around undefined behavior "whatever the machine does" makes sense, as does "we don't know". But "could not happen???" if it could not happen the spec would have said "could not happen" instead the spec does not know what will happen and so punts on the outcome, knowing full well that it will happen all the time.
The problem is that there is no optimization to make around "whatever the hardware does" or "we have no clue" so the incentive is to choose the worst possible reading "undefined behavior is incorrect code and therefore a correct program will never have it".
I would imagine that the standard writers choose one or the other depending on whether the behavior is useful for optimizations. There's also the matter that if a behavior is currently undefined, it's easy to later on make it unspecified or specified, while if a behavior is unspecified it's more difficult to make it undefined, because you don't know how much code is depending on that behavior.
It's practically impossible to find a program without UB.
I mean, if you're going to argue that a compiler can do anything with any UB, then by all means make that argument.
Otherwise, then no, I don't think it's reasonable for a compiler to cause an infinite loop inside a function simply because that function itself doesn't return a value.
https://www.quora.com/What-is-the-most-subtle-bug-you-have-h...
The problem was that the loop itself was altered, rather than that the function returned and then that somehow caused an infinite loop.
> I'm not aware of any compiler that does that, but it's something I could see happening, and the developers would have no reason to "fix" it, because it's perfectly up to spec.
This is where we disagree.
https://people.csail.mit.edu/nickolai/papers/wang-stack.pdf
I submit that that's a small fraction of UB, that much of it would exist at any optimization level.
It's actually a much more torturous reading to say "if any line in the program contains undefined behavior (such as the example given in the standard, integer overflow), then it's OK for the compiler to treat the entire program as garbage and create any behavior whatsoever in the executable."
Which is exactly what had been claimed, that he was addressing.
Sure, but it's unlikely it's an intentional choice to cause an infinite loop simply because your boolean function didn't return a boolean.
But also note that there is an ongoing effort to remove UB from the standard. We have eliminated already about 30% of UB in the core language for the upcoming version C2Y.
Honestly, I do not think that the problem is C is o big that one needs to jump ship. There are real issues, yes, but there are also plenty of good tools and strategies to deal with UB, it is not really an issue for me.
The only dead code is generated code by macros.
(1) "dead" meaning unused types, unreachable branches
if condition that is "always" false:
abort with message detailing the circumstances
That `if` is "dead", in the sense that the condition is always false. But "dead" sometimes is just a proof — or if I'm not rigourous enough, an assumption — in my head. If the compiler can prove the same proof I have in my head, then the dead code is eliminated. If can't, well, presumably it is left in the binary, either to never be executed, or to be executed in the case that the proof in my head is wrong.1. you drop down to assembly.
2. you use functions that are purpose built to be sequence points the optimizer won't optimize through. E.g., in Rust, for the case you mention, `read_volatile`.
In either case, this gives the human the same benefit the code is giving the optimizer: an explicit indication that this code that might appear to be doing nothing isn't.
There are even languages with mandatory else branch.
My point is that it is easy to say "don't remove my code" while looking at a simple single-function example, but in actual compilation huge portions of a function are "dead" after inlining, constant propagation and other optimizations: not talking anything about C-specific UB or other shenanigans. You don't want to throw that out.
On the one hand, having the optimizer save you from your own bad code is a huge draw, this is my desperate hope with SQL, I can write garbage queries and the optimizer will save me from myself.
But... Someone put that code there, spent time and effort to get that machinery into place with the expectation that it is doing something. and when the optimizer takes that away with no hint. That does not feel right either. Especially when the program now behaves differently when "optimized" vs unoptimized.
int factorial(int x) {
if (x < 0) throw invalid_input();
// compute factorial ...
}
This doesn't have any dead code in a static examination: at compilation-time, however, this function may be compiled multiple times, e.g., as factorial(5) or factorial(x) where x is known to be non-negative by range analysis. In this case, the `if (x < 0)` is simply pruned away as "dead code", and you definitely want this! It's not a minor thing, it's a core component of an optimizing compiler.This same pruning is also responsible for the objectionable pruning away of dead code in the examples of compilers working at cross-purposes to programmers, but it's not easy to have the former behavior without the latter, and that's also why something like -Wdead-code is hard to implement in a way which wouldn't give constant false-positives.
I'm talking about the optimizer, not the linker, which thanksfully does a lot of pruning.
It's very common for inline functions in headers to be written for inlining and constant propagation from arguments result in dead code and better generated code. There is even __builtin_constant_p() to help with such things (e.g., you can use it to have a fast folded inline variant if an argument is constant, or call big out of line library code if variable).
There are also configuration systems that end up with config options in headers that code tests with if (CONFIG_BLAH) {...} that can evaluate to zero in valid builds.
However, they're not in widespread use. I would be curious to learn if there's any data/non-anecdotal information as to why. Is it momentum/inertia of GCC/LLVM/MSVC? Are alternative compilers incomplete and can't actually compile a lot of practical programs (belying the "relatively simple program") claim? Or is the performance differential due to optimizations really so significant that ordinary programs like e.g. vim or libjpeg or VLC or whatnot have significant degradations when built on an alternative compiler?
I stopped reading at the abstract; garbage rant full of contradictions.