Everything in C is undefined behavior
279 points
6 hours ago
| 51 comments
| blog.habets.se
| HN
muvlon
3 hours ago
[-]
Yes there is tons of surprising and weird UB in C, but this article doesn't do a great job of showcasing it. It barely scratches the surface.

Here's a way weirder example:

  volatile int x = 5;
  printf("%d in hex is 0x%x.\n", x, x);
This is totally fine if x is just an int, but the volatile makes it UB. Why? 5.1.2.4.1 says any volatile access - including just reading it - is a side effect. 6.5.1.2 says that unsequenced side effects on the same scalar object (in this case, x) are UB. 6.5.3.3.8 tells us that the evaluations of function arguments are indeterminately sequenced w.r.t. each other.

So in common parlance, a "data race" is any concurrent accesses to the same object from different threads, at least one of which is a write. In C, we can have a data race on a single thread and without any writes!

reply
thomashabets2
1 hour ago
[-]
Author here.

> It barely scratches the surface.

I agree. The point of the post is not to enumerate and explain the implications of all 283 uses of the word "undefined" in the standard. Nor enumerate all the things that are undefined by omission.

The point of the post is to say it's not possible to avoid them. Or at least, no human since the invention of C in 1972 has.

And if it's not succeeded for 54 years, "try harder", or "just never make a mistake", is at least not the solution.

The (one!) exploitable flaw found by Mythos in OpenBSD was an impressive endorsement of the OpenBSD developers, and yet as the post says, I pointed it at the simplest of their code and found a heap of UB.

Now, is it exploitable that `find` also reads the uninitialized auto variable `status` (UB) from a `waitpid(&status)` before checking if `waitpid()` returned error? (not reported) I can't imagine an architecture or compiler where it would be, no.

FTA:

> The following is not an attempt at enumerating all the UB in the world. It’s merely making the case that UB is everywhere, and if nobody can do it right, how is it even fair to blame the programmer? My point is that ALL nontrivial C and C++ code has UB.

reply
muvlon
1 hour ago
[-]
Fair enough!

> And if it's not succeeded for 54 years, "try harder", or "just never make a mistake", is at least not the solution.

And I 100% agree. UB is way overused by these standards for how dangerous it is, and as a consequence using C (and C++) for anything nontrivial amounts to navigating a minefield.

reply
saagarjha
1 hour ago
[-]
What should the behavior above be defined to do?
reply
Filligree
23 minutes ago
[-]
Print x twice. Not all “side effects” care about order.

Better yet, define an order for parameter evaluation.

reply
echoangle
26 minutes ago
[-]
Couldn’t you just define that function arguments are evaluated left to right?

Or just throw an error.

reply
saagarjha
4 minutes ago
[-]
I meant reading the uninitialized variable
reply
jeffffff
42 minutes ago
[-]
Compilation error
reply
saagarjha
30 minutes ago
[-]
It’s hard to detect all UB at compile time
reply
Demiurge
6 minutes ago
[-]
It’s harder depending on the language, which is clearly the point.
reply
lll-o-lll
46 minutes ago
[-]
HCF
reply
saagarjha
31 minutes ago
[-]
I have good news about what UB allows
reply
HarHarVeryFunny
9 minutes ago
[-]
> In C, we can have a data race on a single thread and without any writes!

Well, sure, that's what volatile means - that the value may be changed by something else. If it's a global variable then the something else might be an interrupt or signal handler, not just another thread. If it's a pointer to something (i.e. read from a specific address) then that could be a hardware device register who's value is changing.

The concept of a volatile variable isn't the problem - any language that is going to support writing interrupt routines and memory mapped I/O needs to have some way of telling the compiler "don't optimize this out" since reading from the same hardware device register twice isn't like reading from the same memory location twice.

I think the problem here is more that not all of the interactions between language features and restrictions have been fully thought out. It's pretty stupid to be able to explicity tell the language "this value can change at any time", and for it to still consider certain uses of that value as UB since it can change at any time! There should have been a carve out in the "unsequenced side effect" definitions for volatile variables.

reply
tialaramex
1 hour ago
[-]
Volatile is a type system hack. They should have done a more principled fix, and certainly modern languages should not act as though "C did it" makes it a good idea.

The reason for the hack is that very early C compilers just always spill, so you can write MMIO driver code by setting a pointer to point at the MMIO hardware and it actually works because every time you change x the CPU instruction performs a memory write.

Once C compilers got some basic optimisations that obvious "clever" trick stops working because the compiler can see that we're just modifying x over, and over and over, and so it doesn't spill x from a register and the driver doesn't work properly. C's "volatile" keyword is a hack saying "OK compiler, forget that optimisation" which was presumably a few minutes work to implement, whereas the correct fix, providing MMIO intrinsics in the associated library, was a lot of work.

Why should you want intrinsics here? Intrinsics let you actually spell out what's possible and what isn't. On some targets we can actually do a 1-byte 2-byte and 4-byte write, those are distinct operations and the hardware knows, so e.g. maybe some device expects a 4-byte RGBA write and so if you emit four 1-byte writes that's very confusing and maybe it doesn't work, don't do that. On some targets bit-level writes are available, you can say OK, MMIO write to bit 4 of address 0x1234 and it will write a single bit. If you only have volatile there's no way to know what happens or what it means.

reply
rcxdude
1 hour ago
[-]
Yeah, it's also cleaner to be able to mark particular reads and writes as having side effects as opposed to having it be a property of the variable.
reply
saagarjha
1 hour ago
[-]
> The reason for the hack is that very early C compilers just always spill, so you can write MMIO driver code by setting a pointer to point at the MMIO hardware and it actually works because every time you change x the CPU instruction performs a memory write.

Source?

reply
mananaysiempre
2 hours ago
[-]
And it makes sense as long as you allow the concept of unsequenced operations at all (admittedly it’s somewhat rare; e.g. in Scheme such things are defined to still occur in sequence, but which specific sequence is unspecified and potentially different each time). The “volatile” annotation marks your variable as being an MMIO register or something of that nature, something that could change at any point for reasons outside of the compiler’s control. Naturally, this means all of the hazards of concurrent modification are potentially there.

That said, your “common parlance” definition of “data race” is not the definition used by the C standard, so your last sentence is at best misleading in a discussion of standard C.

> The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.

(Here “conflicting” and “happens before” are defined in the preceding text.)

reply
tsimionescu
1 hour ago
[-]
Your first paragraph makes it sound as if the compiler will actually generate two reads of the value of some register, which might lead to unexpected effects at runtime for certain special registers.

However, this is not at all what UB means in C (or C++). The compiler is free to optimize away the entire block of code where this printf() sequence occurs, by the logic that it would be UB if the program were to ever reach it.

For example, the following program:

  int y = rand();
  if (y != 8) {
    volatile int x;
    printf("%d: %d", x, x) ;
  } else {
    printf("y is 8");
  }
Can be optimized to always print "y is 8" by a perfectly standard compliant compiler.
reply
mananaysiempre
28 minutes ago
[-]
> Your first paragraph makes it sound as if the compiler will actually generate two reads of the value of some register, which might lead to unexpected effects at runtime for certain special registers.

I don’t see how. I was trying to explain why it’s reasonable for a volatile read to be a side effect, after which the C rule on unsequenced side effects applies, yielding UB as you say.

reply
shakna
1 hour ago
[-]
"volatile" tells the compiler it is _not_ safe to optimise away any read or write, so it can't just optimise that section away at all.

> An object that has volatile-qualified type may be modified in ways unknown to the implementation or have other unknown side effects. Therefore any expression referring to such an object shall be evaluated strictly according to the rules of the abstract machine, as described in 5.1.2.3. Furthermore, at every sequence point the value last stored in the object shall agree with that prescribed by the abstract machine, except as modified by the unknown factors mentioned previously.

A compliant compiler is only free to optimise away, where it can determine there are no side-effects. But volatile in 5.1.2.3 has:

> Accessing a volatile object, modifying an object, modifying a file, or calling a function that does any of those operations are all side effects.

reply
rcxdude
1 hour ago
[-]
Yes, but undefined behaviour is undefined behaviour, and that behaviour can legally be that the code is not emitted at all, volatile (or any other side effect) or not. (and compilers do reason about undefined behaviour when optimising, so this isn't necessarily a completely theoretical argument, though I don't know whether the in compiler's actual logic which of 'don't optimise volatile' or the 'do assume undefined behaviour is impossible and remove code that definitely invokes it' would 'win', or whether there's any current compiler that would flag this as unconditionally undefined behaviour in the first place).
reply
shakna
58 minutes ago
[-]
Volatile wins.

GCC calls that out [0] - volatile means things in memory may not be what they appear to be, and that there are asynchronous things happening, so something that may not appear to be possible, may become so, because volatile is a side-effect.

So about the only optimisation allowed to happen, is combining multiple references.

Clang is similar:

> The compiler does not optimize out any accesses to variables declared volatile. The number of volatile reads and writes will be exactly as they appear in the C/C++ code, no more and no less and in the same order.

[0] https://www.gnu.org/software/c-intro-and-ref/manual/html_nod...

reply
poizan42
23 minutes ago
[-]
That's cool and all if you are writing GCC or Clang dialect C, but it doesn't change the fact that it is UB in the C standard.
reply
u8080
55 minutes ago
[-]
When compiler decides something is UB aka "result of this code is not defined and could be any" it selects the most performant version of undefined behavior - doing nothing by optimizing code away.
reply
shakna
43 minutes ago
[-]
The compiler is not free to remove accesses to something marked volatile - its defined as a side-effect.

Volatile means something else may be acting here. Something else may install anything into the register at any time - and every time you access.

The compiler is required to preserve the order of accesses. In almost every C compiler, today, there are almost no optimisations the moment a volatile is introduced, for this reason.

reply
tsimionescu
28 minutes ago
[-]
If code has undefined behavior, the entire execution path that leads to that UB has no assigned semantics in the C model. So there are no volatile accesses in this code according to the C abstract machine - the entire execution path is UB, so it can be assumed it doesn't happen at all.
reply
saagarjha
1 hour ago
[-]
Sure it can. That code path has unconditional UB and thus it is not valid.
reply
shakna
1 hour ago
[-]
Only if there would be no side-effects. Which there are.
reply
saagarjha
31 minutes ago
[-]
No this is irrelevant for making this decision
reply
shakna
22 minutes ago
[-]
I've mentioned elsewhere the standards, and compilers as well, disagreeing with you here.

But feel free to run against the various compilers through godbolt. [0] They won't optimise the branch away. Access to a volatile, must be preserved, in the order that they exist. No optimisation, UB or otherwise, is allowed to impede that. Because an access is a side-effect.

[0] https://godbolt.org/z/85cGhq3Ta

reply
saagarjha
6 minutes ago
[-]
That they won’t is as most a courtesy to you but they are not required to do this.
reply
nilamo
20 minutes ago
[-]
This looks like a long back and fourth, that can easily be solved by a minute or two on godbolt...
reply
aw1621107
7 minutes ago
[-]
> that can easily be solved by a minute or two on godbolt...

Unfortunately it's not that simple when it comes to UB. If the snippet in question does in fact exhibit UB then there's no guarantee whatever Godbolt shows will generalize to other programs/versions/compilers/environments/etc.

reply
nilamo
28 seconds ago
[-]
That's very funny to me.

A) x is always removed.

B) no, it's never removed if volatile.

But neither person can prove what a compiler will actually do, despite claiming they'll always act a certain way given 5 lines of code.

reply
saagarjha
5 minutes ago
[-]
No, compilers will often choose to not optimize on UB.
reply
simonask
3 hours ago
[-]
I think the article's point is that you don't actually have to get weird at all to run into UB.

Lots of people mistakenly think that C and C++ are "really flexible" because they let you do "what you want". The truth of the matter is that almost every fancy, powerful thing you think you can do is an absolute minefield of UB.

reply
kzrdude
1 hour ago
[-]
My go-to example of "UB is everywhere" is this one:

    int increment(int x) {
        return x + 1;
    }
Which is UB for certain values of x.
reply
CodeArtisan
1 hour ago
[-]
C23 removed the whole stuff about indeterminate value and trap representation. Underflow/overflow being silent or not is implementation defined.
reply
saagarjha
1 hour ago
[-]
Signed overflow is just undefined.
reply
jstimpfle
2 hours ago
[-]
I would agree that C is "really flexible", but I would say it's primarily flexible because it lets you cast say from a void pointer to a typed pointer without requiring much boilerplate. It's also flexible because it lets you control memory layout and resource management patterns quite closely.

If you want to be standards correct, yes you have to know the standard well. True. And you can always slip, and learn another gotcha. Also true. But it's still extremely flexible.

reply
crote
1 hour ago
[-]
The problem is that a lot of the flexibility introduced by UB doesn't serve the developer.

Take signed integer overflow, for example. Making it UB might've made sense in the 1970s when PDP-1 owners would've started a fight over having to do an expensive check on every single addition. But it's 2026 now. Everyone settled on two's complement, and with speculative execution the check is basically free anyways. Leaving it UB serves no practical purpose, other than letting the compiler developer skip having to add a check for obscure weird legacy architectures. Literally all it does is serve as a footgun allowing over-eager optimizations to blow up your program.

Although often a source of bugs, C's low-level memory management is indeed a great source of flexibility with lots of useful applications. It's all the other weird little UB things which are the problem. As the article title already states: writing C means you are constantly making use of UB without even realizing it - and that's a problem.

reply
ablob
1 hour ago
[-]
If we're talking two's complement it's not undefined that is right. Having to emit checks though, that is where I beg to differ. A check is only useful if you want to actually change the behavior when it happens, otherwise it is useless. Furthermore, it might be "essentially free" from a branch prediction point, but low and behold caches exist. You would pollute both the instruction cache with those instructions _and_ the branch prediction cache. From this it doesn't follow at all, that there is no cost.

In the end small things do add up, and if you're adding many little things "because it doesn't cost much nowadays" you will end up with slow software and not have one specific bottleneck to look at. I do agree that having the option for checked operations is nice (see C#), but I have needed this behavior (branching on overflow) exactly once so far.

reply
saagarjha
1 hour ago
[-]
Signed overflow checks are typically not free unfortunately they have a cost of about 5% or thereabouts
reply
simonask
1 hour ago
[-]
It's not flexible in practice, because knowing the standard isn't optional. If you make the choice to not follow the standard, you're making the choice to write fundamentally broken software. Sometimes with catastrophic consequences.
reply
jstimpfle
1 hour ago
[-]
I'm making the choice to pass pointers as void to get low-friction polymorphism. I'm making the choice to control the memory layout of my data structures, including of levels and type of indirection. I'm making the choice to control my own memory allocators and closely control lifetimes, closely control (almost) everything that happens in the system.

That has nothing to do with not following the standard.

reply
saagarjha
1 hour ago
[-]
But be as you may you’re not following the standard.
reply
3form
2 hours ago
[-]
At which point it feels like some sort of high-level assembly-like language, which is simple enough to compile efficiently and stay crossplatform, with some primitives for calls, jumps, etc. could find a nice niche.

Maybe this already exists, even? A stripped down version of C? A more advanced LLVM IR? I feel like this is a problem that could use a resolution, just maybe not with enough of a scale for anyone to bother, vs. learning C, assembly of given architecture, or one of the new and fancy compiled languages.

reply
simonask
1 hour ago
[-]
Well, Zig is aiming to be a "saner C", and mostly succeeding so far. I hope they make it to production.

Rust is a somewhat more thorough attempt to actually course-correct.

reply
berti
1 hour ago
[-]
Reading a register from a microcontroller peripheral may well reset it as an example of a possible side-effect here, and that's exactly the kind of thing you use volatile for.
reply
sethev
2 hours ago
[-]
Yes, there is a data race there. The value of a volatile can be changed by something outside the current thread. That’s what volatile means and why it exists.

Edit: thread=thread of execution. I’m not making a point about thread safety within a program.

reply
mananaysiempre
2 hours ago
[-]
Not from the standard’s point of view. The traditional (in some circles) use of volatile for atomic variables was not sanctioned by the C11/C++11 thread model; if you want an atomic, write atomic, not volatile, or be aware of your dependency on a compiler (like MSVC) that explicitly amends the language definition so as to allow cross-thread access to volatile variables.
reply
sethev
2 hours ago
[-]
Thread was a poor choice of word. Outside the control of the program is a better way to put it. Like memory mapped io.
reply
trissylegs
1 hour ago
[-]
Can also represent a register that has an effect reading it. Reading a memory mapped register can have side effects. Like memory mapped io on a UART will fetch the next byte to be read.
reply
frollogaston
44 minutes ago
[-]
Was going to say the same thing until I saw this comment. volatile is defined the way I'd expect, plus it's a strange code example.
reply
jstimpfle
2 hours ago
[-]
Not sure why you're being downvoted. That's completely right. The example is silly. The code is obviously bad, doesn't matter if it's UB or not.

I'm also not convinced (yet) that the example really is UB: I agree reading a volatile is "a side effect" in some sense, and GP cited a paragraph that says just that. But GP doesn't clearly quote that it's a side effect on the object (or how a side effect on an object is defined). Reading an object doesn't mutate it after all.

But whatever language lawyer things, the code is obviously broken, with an obvious fix, so I'm not so interested in what its semantics should be. Here is the fix:

    volatile int x;
    // ...
    int val = x;  // volatile read
    printf("%x %d\n", val, val);
reply
crote
1 hour ago
[-]
The problem is that the function call as a whole is UB. Having the original example compile to the equivalent of

  volatile int x;
  int a = x;
  int b = x;
  printf("%x %d\n", a, b);
is equally valid as

  volatile int x;
  int a = x;
  int b = x;
  printf("%x %d\n", b, a);
, and neither needs to have the same output as your proposed fix.

C could've specified something like "arguments are evaluated left-to-right" or "if two arguments have the same expression, the expression is [only evaluated once]/[always evaluated twice]". But it didn't, so the developer is left gingerly navigating a minefield every time they use volatile.

reply
indigo945
1 hour ago
[-]
Not only is "arguments are evaluated left-to-right" less easy to formalize than you think, it would also make all C code run slower, because the compiler would no longer be able to interleave computations for more efficient pipelining. The same goes for "expression is [only evaluated once]/[always evaluated twice]".

Of course the developer is navigating a minefield every time they use volatile, that's why it's called "volatile" - an English word otherwise only commonly used in chemistry, where it means "stuff that wants to go boom".

reply
imtringued
37 minutes ago
[-]
Your argument makes no sense since the developer is expected to perform manual sequencing. Correctly written UB free code cannot be interleaved either.

All you've achieved is that the standard C function call syntax can no longer be used as is.

reply
RobotToaster
1 hour ago
[-]
With volatile it could be changed by an interrupt service routine between reads, so it makes sense.
reply
rramadass
1 hour ago
[-]
This has got nothing to do with data races etc. but everything to do with "Sequence Points and Single Update Rule" which is well described in C language specification.

See my comment here - https://news.ycombinator.com/item?id=48205760

reply
imtringued
43 minutes ago
[-]
Memory mapped IO sends a read request to a peripheral which is allowed have side effects in the background and return two different values upon a read. You can think of it as a synchronous RPC request.

The lack of argument sequencing feels utterly petty however.

reply
beeforpork
4 hours ago
[-]
The UB in unaligned pointers is even worse: an unaligned pointer in itself is UB, not only an access to it. So even implicit casting a void*v to an int*i (like 'i=v' in C or 'f(v)' when f() accepts an int*) is UB if the cast pointer is not aligned to int.

It is important to understand that this is a C level problem: if you have UB in your C program, then your C program is broken, i.e., it is formally invalid and wrong, because it is against the C language spec. UB is not on the HW, it has nothing to do with crashes or faults. That cast from void* to int* most likely corresponds to no code on the HW at all -- types are in C only, not on the HW, so a cast is a reinterpretation at C level -- and no HW will crash on that cast (because there is not even code for it). You may think that an integer value in a register must be fine, right? No, because it's not about pointers actually being integers in registers on your HW, but your C program is broken by definition if the cast pointer is unaligned.

reply
thomashabets2
3 hours ago
[-]
Author here.

> an unaligned pointer in itself is UB

Yup. Per the "Actually, it was UB even before that" section in the post.

> UB is not on the HW, it has nothing to do with crashes or faults

Yeah. I tried to convey this too, but I'm also addressing the people who say "but it's demonstrably fine", by giving examples. Because it's not.

reply
account42
3 hours ago
[-]
Which is totally fine and expected for any decent programmer. Casting pointers is clearly here be dragons territory.
reply
simonask
3 hours ago
[-]
Many, many programmers come to C (and C++) with a lower-level understanding that actually gets in the way here. They understand that all types "are" just bytes and that all pointers "are" just register-sized integer addresses, because that's how the hardware works and has worked for decades.

It's perfectly reasonable to expect any load through `int*` to just load 4 bytes from memory, done and done. They get surprised that it is far from the whole story, and the result is UB.

Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead. But no.

reply
lelanthran
3 hours ago
[-]
> They understand that all types "are" just bytes and that all pointers "are" just register-sized integer addresses, because that's how the hardware works and has worked for decades.

I'd clarify this with "They understand that all values are just bytes".

> Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead.

It's partly the standards fault here - rather than saying "We don't know how vendors will implement this, so we shall leave it as implementation-defined", they say "We don't know how vendors will implement this, so we will leave it as undefined".

A clear majority of the UB problems with C could be fixed if the standards committee slowly moved all UB into IB. It's not that there isn't any progress (Signed twos-complement is coming, after all), it's that there is (I believe) much pushback from compiler authors (who dominate the standards) who don't want to make UB into IB.

reply
saagarjha
59 minutes ago
[-]
Turning undefined behavior into implementation defined behavior is rarely a fix, though.
reply
lelanthran
44 minutes ago
[-]
It's a fix that removes the most pointy part of UB.

"Going past the end of the array results in addressing arbitrary values" I can live with. "Going past the end of an array results in anything happening" is a hard sell.

reply
saagarjha
32 minutes ago
[-]
I think it’s a really easy sell, actually: if you go past the end of the array far enough you end up accessing the stack which includes parts of the program like “where does this function return to” or “what is the index used to perform this access” or “there is no page mapped there”. None of these are arbitrary values.
reply
benj111
2 hours ago
[-]
>It's partly the standards fault here - rather than saying "We don't know how vendors will implement this, so we shall leave it as implementation-defined", they say "We don't know how vendors will implement this, so we will leave it as undefined

I'd agree to a point. I still think it's unreasonable for compiler writers to get all lawyery about precise terminology. After all "implementation defined" could still be subject to the same lawyeriness (we implemented it, ergo we define it).

To me this is an issue of culture. We need to push back against the view that UB means anything can happen, therefore the compiler can do anything.

reply
fc417fc802
1 hour ago
[-]
But it's genuinely useful. In all seriousness, are you sure you aren't perhaps just using the wrong language? At this point UB and leveraging it for optimization are core parts of the most performant C implementations.

That said, I think there are many cases where compilers could make a better effort to link UB they're optimizing against to UB that appears in the code as originally authored and emit a diagnostic or even error out. But at least we've got ubsan and friends so it seems like things are within reason if not optimal.

reply
benj111
1 hour ago
[-]
>are you sure you aren't perhaps just using the wrong language

Well I think there is a tension here. C is the language for microcontrollers and the language for high performance.

In ye olden days both groups interests were aligned because speed in C was about working with the machine. Now the UB has been highjacked for speed, that microcontroller that I'm working on, where I know and int will overflow and rely on that is UB so may be optimised out, so I then have to think about what the compiler may do.

I wouldn't say C is the wrong language. I would say there are wrong compilers though.

reply
circuit10
1 hour ago
[-]
This series was a good explanation for me of why treating UB this way is genuinely useful: https://blog.llvm.org/2011/05/what-every-c-programmer-should...

Being able to assume certain things don't happen is powerful when you're writing optimisations, not doing that would have a real performance cost

reply
da-alex
1 hour ago
[-]
> Meanwhile, the actual computers we have been using for decades have no problems actually just loading 4 bytes through any arbitrary pointer with zero overhead. But no.

Not if those 4 bytes span a cacheline boundary, that will most likely result in 1/2 throughput compared to loading values inside a single cacheline. And if it causes cache-misses it takes up twice the L2 or L3 bandwidth.

Even worse, if the int spans two pages, it will need two TLB lookups. If it's a hot variable and the only thing you use from those pages, it even uses up an additional TLB entry, that could otherwise be used for better perf elsewhere, etc.

And if you're on embedded (and many C programs are), Cortex-M CPUs either can't handle unaligned accesses (M0, M0+) or take 2-3 times as long (split the load into 2x2 byte or 1x2 + 2x1 byte)

reply
pjc50
3 hours ago
[-]
Except ARM32. ARM64 doesn't guarantee it to be valid in all cases either.
reply
stilley2
1 hour ago
[-]
Does that mean that if I have a struct with #pragma pack(push, 1) I can't use pointers to any members that don't happen to be aligned?
reply
saagarjha
1 hour ago
[-]
This is a non-standard extension, so your compiler may provide stronger guarantees.
reply
imtringued
27 minutes ago
[-]
The problem with C UBI is that originally it meant the compiler has the freedom to map your code to the hardware inspite of machine instructions differing slightly between one another. The same C program may express different behaviour depending on which architecture it is running on.

This type of UB is fine and nobody really complains about hardware differences leading to bugs.

However, over time aggressive readings of UB evolved C into an implicit "Design by Contract" language where the constraints have become invisible. This creates a similar problem to RAII, where the implicit destructor calls are invisible.

When you dereference a pointer in C, the compiler adds an implicit non-nullable constraint to the function signature. When you pass in a possibly nullable pointer into the function, rather than seeing an error that there is no check or assertion, the compiler silently propagates the non-nullable constraint onto the pointer. When the compiler has proven the constraints to be invalid, it marks the function as unreachable. Calls to unreachable functions make the calling function unreachable as well.

reply
tovej
3 hours ago
[-]
But that seems obvious. You can't load an integer from an unaligned address.

It's not only C-level is it. There's no (guarantee across architectures for) machine code for that either.

reply
codeflo
3 hours ago
[-]
> You can't load an integer from an unaligned address.

You can, and the results are machine specific, clearly defined and well-documented. Ancient ARM raises an exception, modern ARM and x86 can do it with a performance penalty. It's only the C or C++ layer that is allowed to translate the code into arbitrary garbage, not the CPU.

reply
saagarjha
58 minutes ago
[-]
There’s usually not a performance penalty on modern hardware
reply
matheusmoreira
3 hours ago
[-]
Sure you can. In many architectures it works just fine. Works perfectly in x86_64, for example. It's just a little slower.
reply
tovej
2 hours ago
[-]
In many architectures does not mean you can. The standard is supposed to cover all architectures.
reply
crote
1 hour ago
[-]
That's why we write C instead of assembly, isn't it?

You could also mandate that a compiler for architectures without unaligned access either has to prove that the access is going to be aligned or insert a wrapper to turn the unaligned access into two aligned ones.

Just pretending the issue doesn't exist at all and making it the programmer's problem by leaving it as UB in the spec is a choice.

reply
matheusmoreira
1 hour ago
[-]
If some architecture traps on unaligned access, then the compiler can and should simply generate the correct code so that it loads the integer piece by piece instead. Load multiple integers and shift and mask away the irrelevant bits, done. This is exactly what modern architectures already do in hardware. Works, it's just a little slower.

This is exactly what the compilers do if you use a packed structure to access unaligned data. Works everywhere, as expected. Compilers have always known what to do, they just weren't doing it. C standard says no.

The fact is the standard is garbage and the first thing every C programmer should learn is that they can and should ignore it. There is never any reason to wonder what the standard is supposed to do. The only thing that matters is what compilers actually do.

reply
bluGill
1 hour ago
[-]
The pointer might be something you forced. The compiler needs to do the right thing but if you set the pointer to an unaligned address because you have information on the hardware you can get this undefined situation with nothing the compiler can do about it.
reply
matheusmoreira
1 hour ago
[-]
Any reason the hardware pointer can't be accessed via the packed structure?

https://news.ycombinator.com/item?id=48205371

reply
saagarjha
57 minutes ago
[-]
The same reason you probably aren’t adding manual alignment fixes to your code?
reply
matheusmoreira
39 minutes ago
[-]
No reason at all, then. Because I am manually dealing with alignment in my code.

Wrote a lisp, its bytes type supports reading and writing integers at arbitrary locations within the buffer. Test suite exercises aligned and unaligned memory access for every C integer type. Also wrote my own mem* functions, dealing with alignment in those was certainly a fun exercise. It wasn't necessary, I just wanted the performance benefits.

reply
bluGill
1 hour ago
[-]
however you certainly can do that. The point of unaligned is the hardware can't load it from a single memory location in one address. It needs two accesses. And in that time, the value of one of the two addresses that the hardware has to load can change.

I would hope you're not so stupid as to design hardware that relies on this, but the fact is it certainly is possible for someone to do that. And if you do that, there is nothing that the compiler or the standard can do. It can't be done correctly

reply
matheusmoreira
49 minutes ago
[-]
Yeah, the unaligned accesses aren't going to be atomic unless the hardware supports it.

> And in that time, the value of one of the two addresses that the hardware has to load can change.

You mean volatile addresses that could spontaneously change in the middle of the reads? Like memory mapped I/O addresses?

I would expect these to have stricter access requirements than arbitrary general purpose memory locations.

> I would hope you're not so stupid as to design hardware that relies on this

You and me both.

> And if you do that, there is nothing that the compiler or the standard can do. It can't be done correctly

Anything that does that is broken and terrible anyway. It really shouldn't contaminate language design. It's the sort of thing that compilers should be adding attributes for, rather than constraining the language to the point nothing works correctly and making us use attributes on everything to restore some sane baseline behavior.

reply
bluGill
28 minutes ago
[-]
> Anything that does that is broken and terrible anyway

which is why it is undefined behaviour. the optimizer writers have told me consistently that if they can assume you're not doing this thing that's stupid anyway, they can make my code faster. And since I'm not doing that stupid thing anyway, I want my code to be faster.

reply
matheusmoreira
20 minutes ago
[-]
Unaligned memory access isn't really stupid though. Not in the general case. Not to the point where it should give the compiler free reign to crash things or introduce security holes. It should just introduce a performance regression instead, which is a tractable problem. Just measure it and fix it by making things aligned.

Compilers can add some custom attributes that encode whatever semantics the badly designed hardware requires. This lets it freely break incorrect code in the small sections that are actually handling those special variables, while allowing the rest of the language to make sense.

reply
da-alex
1 hour ago
[-]
But if it's a pointer, the compiler doesn't know the alignment at compile time. Should the compiler insert an alignment check of every pointer access?
reply
matheusmoreira
1 hour ago
[-]
Compilers could add support for an unaligned attribute that we can apply to pointers. I'd prefer that to wrapping everything in a packed structure which is quite unsightly.

Would have been better if correct behavior was the default while pointer alignment requirements were opt in, just like vector stuff. Nothing we can do about it now.

I would hope the compiler is smart enough to figure out which accesses are aligned and unaligned on its own.

reply
mbel
3 hours ago
[-]
Unless your code targets some exotic architecture, like idk x86.
reply
cataphract
1 hour ago
[-]
Not really. Wait until the compiler starts vectorizing your code and using instructions requiring alignment (like the ones with A or NT in the mnemonic).
reply
saagarjha
56 minutes ago
[-]
Usually the compiler will probably not generate those
reply
pjc50
3 hours ago
[-]
You missed the point: the pointer existing as a value of that type at all is UB, even if you never try to access anything through it and no corresponding machine code is ever emitted.
reply
tovej
2 hours ago
[-]
Yes? I agree with that. I don't really see the issue there. The computer will allocate data in aligned addresses, so you would have to be doing something weird to begin with to access unaligned pointers. And aligned access is always better anyway. I guess packed structs are a thing if you're really byte golfing. Maybe compressed network data would also make sense.

But then I would assume you are aware of unaligned pointers, and have a sane way to parse that data, rather than read individual parts of it from a raw pointer.

I am curious, what would be a legitimate reason for an unaligned pointer to int?

reply
amiga386
17 minutes ago
[-]
Can anyone explain why this is undefined behaviour? UBSan calls it "indirect call of a function through a function pointer of the wrong type"

    struct foo {int i;};
    int func(struct foo *x) {return x->i;}
    int main() {
        int (*funcptr)(void*) = (int (*)(void*)) &func;
        struct foo foo = { 42 };
        return funcptr(&foo);
    }
While this is all kosher per the language lawyers:

    struct foo {int i;};
    int func(void *x) {return ((struct foo *)x)->i;}
    int main() {
        int (*funcptr)(void*) = &func;
        struct foo foo = { 42 };
        return funcptr(&foo);
    }
reply
j16sdiz
1 minute ago
[-]
Two function pointer (in practice) compatible or not depends on machine specific calling convention.

I guess enumerating all the possibility is just .. don't look right? make the standard too long and complex?

reply
tomp
13 minutes ago
[-]
Casting to a pointer of incompatible type is UB. The exception is casting to char*.
reply
quelsolaar
4 hours ago
[-]
The 5 stages of learning about UB in C:

-Denial: "I know what signed overflow does on my machine."

-Anger: "This compiler is trash! why doesn't it just do what I say!?"

-Bargaining: "I'm submitting this proposal to wg14 to fix C..."

-Depression: "Can you rely on C code for anything?"

-Acceptance: "Just dont write UB."

reply
matheusmoreira
3 hours ago
[-]
What stage is the "just make the compiler define the undefined" stage?

Unaligned access? Packed structs. Compiler will magically generate the correct code, as if it had always known how to do it right all along! Because it has, in fact, always known how to do it right. It just didn't.

Strict aliasing? Union type punning. Literally documented to work in any compiler that matters, despite the holy C standard never saying so. Alternatively, just disable it straight up: -fno-strict-aliasing. Enjoy reinterpreting memory as you see fit. You might hit some sharp edges here and there but they sure as hell aren't gonna be coming from the compiler.

Overflow? Just make it defined: -fwrapv. Replace +, -, * with __builtin_*_overflow while you're at it, and you even get explicit error checking for free. Nice functional interface. Generates efficient code too.

The "acceptance" stage is really "nobody sane actually cares about the C standard". The standard is garbage, only the compilers matter. And it turns out that compilers have plenty of extremely useful functions that let you side step most if not all of this. People just don't use this because they want to write "portable" "standard" C. The real acceptance is to break out of that mindset.

Somehow I built an entire lisp interpreter in freestanding C that actually managed to pass UBSan just by following the above logic. I was actually surprised at first: I expected it to crash and burn, but it didn't. So if I can do it, then anyone can do it too.

reply
quelsolaar
6 minutes ago
[-]
A lot of the Central UB can not be defined, because they rely on detection. In order to have a well defined behaviour (by the standard or the compiler) the implementation needs to first detect that the behaviour is triggered, this is often very tricky or expensive. Its easy to define that a program should halt, if it writes outside an array, but detecting if it does can be both slow and hard to implement. There are implementations that do, but they are rarely used outside of debugging.

A better way to think about UB is as a contract between developer and implementation, so that the implementations can more easily reason about the code. How would you optimize:

(x * 2) / 2

An optimizer can optimize this out for a signed integer, because it doesn't have to consider overflow, but with a unsigned integer it can not. UB is a big reason why C is the most power efficient high level language.

reply
gpderetta
3 hours ago
[-]
> Unaligned access? Packed structs.

Packed structs are dangerous. You can do unaligned accesses through a packed type, but once you take the address of your misaligned int field, then you are back into UB territory. Very annoying in C++ when you try to pass the a misaligned field through what happens to be generic code that takes a const reference, as it will trigger a compiler warning. Unary operator+ is your friend.

reply
matheusmoreira
2 hours ago
[-]
> but once you take the address of your misaligned int field

Gotta work with the structure directly by taking the address of the packed structure itself.

  struct uu64 {
      u64 value;
  } __attribute__((packed));

  struct uu64 unaligned;
  struct uu64 *address = &unaligned;

  address->value; // this works

  u64 *broken = &address->value; // this doesn't
Taking the address of the field inside the structure essentially casts away the alignment information that was explicitly added to stop the compiler from screwing things up. So it should not be done.

Mercifully, both gcc and clang emit address-of-packed-member warnings if it's done. So the packed structures are effectively turning silently broken nonsense code into sensible warnings. Major win.

reply
lelanthran
3 hours ago
[-]
> What stage is the "just make the compiler define the undefined" stage?

It can be left as implementation defined, which means that the compiler can't simply do arbitrary things, it needs to document what it would do.

Take, for example, signed-integer overflow: currently a compiler can simply refuse to emit the code in one spot while emitting it in another spot in the same compilation unit! Making it IB means that the compiler vendor will be forced to define what happens when a signed-integer overflows, rather than just saying, as they do now, "you cannot do that, and if you do we can ignore it, correct it, replace it or simply travel back in time and corrupt your program".

> Somehow I built an entire lisp interpreter in freestanding C that actually managed to pass UBSan just by following the above logic. I was actually surprised at first: I expected it to crash and burn, but it didn't. So if I can do it, then anyone can do it too.

Same here; I built a few non-trivial things that passed the first attempt at tooling (valgrind, UBsan with tests, fuzzing, etc) with no UB issues found.

reply
matheusmoreira
2 hours ago
[-]
Completely agree. It can, and I think it's extremely annoying that it wasn't.

So we have the next best thing: builtins and flags. So long as those cover all the undefined behavior there is, we can live with it. Compiler gets to be "conformant" and we get to do useful things without the compiler folding the code into itself and inside out.

reply
thomashabets2
3 hours ago
[-]
Author here.

> -Acceptance: "Just dont write UB."

The point of my article is that this is not possible. This cannot be our end state, as long as humans are the ones writing the code. No human can avoid writing UB in C/C++.

reply
jart
2 hours ago
[-]
It's honestly not that difficult to be rigorous. The things you mentioned in the blog post are pretty obvious forms of degenerate practices once you get used to seeing them. The best way to make your argument would be to bring up pointer overflow being ub. What's great about undefined behavior is that the C language doesn't require you to care. You can play fast and loose as much as you want. You can even use implicit types and yolo your app, writing C that more closely resembles JavaScript, just like how traditional k&r c devs did back in the day under an ilp32 model. Then you add the rigor later if you care about it. For most stuff, like an experiment, we obviously don't care, but when I do, I can usually one shot a file without any UB (which I check by reading the assembly output after building it with UBSAN) except there's just one thing that I usually can't eliminate, which is the compiler generating code that checks for pointer overflow. Because that's just such a ridiculous concept on modern machines which have a 56 bit address space. Maybe it mattered when coding for platforms like i8086. I've seen almost no code that cares about this. I have to sometimes, in my C library. It's important that functions like memchr() for example don't say `for (char *p = data, *e = data + size; p<e; ...` and instead say `for (size_t i = 0; i < n; ++i) ...data[i]...`. But these are just the skills you get with mastery, which is what makes it fun. Oh speaking of which, another fun thing everyone misses is the pitfalls of vectorization. You have to venture off into UB land in order to get better performance. But readahead can get you into trouble if you're trying to scan something like a string that's at the end of a memory page, where the subsequent page isn't mapped. My other favorite thing is designing code in such a way that the stack frame of any given function never exceeds 4096 bytes, and using alloca in a bounded way that pokes pages if it must be exceeded. If you want to have a fun time experiencing why the trickiness of UB rules are the way they are, try writing your own malloc() function that uses shorts and having it be on the stack, so you can have dynamic memory in a signal handler.
reply
thomashabets2
17 minutes ago
[-]
> It's honestly not that difficult to be rigorous.

Ok, let's try it. I pointed GPT 5.5 at the smallest part of cosmopolitan as I could find in two seconds, net/finger. 299 lines.

describesyn.c:66: q + 13 constructs a pointer that can point well beyond the array plus one element.

C23 6.5.6p9:

> If the pointer operand and the result do not point to elements of the same array object or one past the last element of the array object, the behavior is undefined

Now… you may be trolling, but I do feel like this disproves your assertion. Not you, not me, not Theo de Raadt, can avoid UB.

> the compiler generating code that checks for pointer overflow.

Do you need to check for that specifically? What pointer are you constructing that is not either pointing at a valid object correctly aligned (not UB), or exactly one past the element of an array?

Do you mean for the latter, in case you have an array that ends on the maximum expressible pointer address?

I'm a bit unclear on what you mean by "pointer overflow". From mentioning 56 bit address spaces I'm guessing you mean like the pointer wrapped, not what I pointed to in cosmopolitan, above?

Ok, to be clear that it's not just that one type, if you forgive that one:

net/http/base32.c:64: read sc[0] even if sl=0. I assume this is never called with sl=0, so could be fine.

net/http/ssh.c:355: pointer address underflow? Should that be `e - lp`?

net/http/ssh.c:209/229: double destroy of key. can this code path have non-null members, meaning double free? Looks like it, since line 207 does the parsing and checks that parse worked.

net/http/ssh.c:123: uses memset, which assumes that it sets member variable pointers to NULL (per my post, depending on that means depending on UB), and later these pointers are given to free(), so that's UB.

I won't look deeper into net/http, but presenting just the possibly incorrect remaining comments from jippity:

  - ssh.c:211 and parsecidr.c:44: length-taking APIs use unbounded strstr() / strchr(), so explicit n with non-NUL-terminated input can read beyond the buffer.

  - tokenbucket.c:77 and tokenbucket.c:92: x >> (32 - c) is UB for c == 0 and for out-of-range c.

  - isacceptablehost.c:68: long numeric host labels can overflow signed int b before the function eventually rejects/accepts the host.
reply
frollogaston
42 minutes ago
[-]
"Just don't write UB" sounds like still part of the bargaining stage at best
reply
im3w1l
3 hours ago
[-]
In C, acceptance is "I will write UB and it will eventually lead to something bad happening"
reply
Ygg2
4 hours ago
[-]
> -Acceptance: "Just dont write UB."

Just switch to a saner language.

And before I get attacked for being a Rust shill, I meant Java :P

The bar is so low it's floating near the center of the Earth.

reply
dns_snek
3 hours ago
[-]
> And before I get attacked for being a Rust shill, I meant Java :P

If all you want is C but less insane then the obvious answer here is Zig.

reply
simonask
3 hours ago
[-]
Zig is cool, but it is not even close to being ready for prime-time. It will be pre-1.0 for a while, and major breaking changes are still happening.
reply
dns_snek
3 hours ago
[-]
Sure, maybe don't bet your entire company on mountains of Zig code just yet, but aside from the breaking changes it's been perfectly usable and suitable for every project I've ever wanted to work on.
reply
AgentME
2 hours ago
[-]
If someone is switching from C because it's too easy to trigger undefined behavior, picking one of the few other not memory safe languages is missing the point.
reply
psychoslave
3 hours ago
[-]
If all somebody want is a programming language than C/C++ on these matter, there are plentiful options of the shelf to pick from.

If all somebody want is a turn key replacement to C/C++ ecosystem, then there is nothing like that in the world that I’m aware of.

reply
p2detar
4 hours ago
[-]
> Just switch to a saner language.

And where's the fun in that?

reply
psychoslave
3 hours ago
[-]
That’s a taste matter. Being recalled that what is expressed is always depending on some technical details on every move, this is great when one is loving technical details and have all the leisure time to pay attention to them. This is going to be hell compared to sound defaults for someone willing to focus on delivering higher order feature/functionality which will most likely work just fine.

Unedefined behaviour means "we couldn’t settle on a best default trade-off with fine-tuning as a given option so we let everyone in the unknown".

reply
ErroneousBosh
3 hours ago
[-]
Okay, so Java compiles to machine code now?

Because the last time I looked it appeared to need some godawful slow bytecode interpreter that took up thousands of kilobytes of RAM.

reply
elch
3 hours ago
[-]
If you don't like JIT/JVM there's GraalVM Native Image.

https://www.graalvm.org/latest/reference-manual/native-image...

In the past you could use e.g. Excelsior JET.

reply
ErroneousBosh
1 hour ago
[-]
Great, can you fit it into 768 bytes of flash and 64 bytes of RAM?
reply
crote
45 minutes ago
[-]
It isn't 1970 anymore. You can get 32-bit ARM MCUs with tens of kilobytes of flash and multiple kilobytes of RAM for less than 10 cents.

We've long since reached a point where chips are cheap enough to be disposable. They are included in paper transit tickets and price tags. There is basically no market left where your volume is small enough that custom application-specific ICs aren't an option, but your volume is large enough that the cost of a few additional kilobytes of memory isn't massively outweighed by the developer time saved.

Want several megabytes of RAM and flash to run Java? That's the price of a cup of coffee!

reply
ErroneousBosh
10 minutes ago
[-]
> It isn't 1970 anymore. You can get 32-bit ARM MCUs with tens of kilobytes of flash and multiple kilobytes of RAM for less than 10 cents.

Do they run at single-digit nA current draw?

reply
pjc50
3 hours ago
[-]
Java has been jitted for .. decades?
reply
Hendrikto
2 hours ago
[-]
You know what JIT means, right? It means that is is not compiled from the start and indeed runs on a bytecode interpreter until the JIT compiler kicks in.
reply
fc417fc802
1 hour ago
[-]
The java JIT has produced sufficiently fast code for all but the most demanding of HPC applications for going on 20 years. I realize keeping up with new developments can be difficult but the out of date java performance memes are entirely ridiculous by now.

Meanwhile half the world appears to run on cpython of all things.

reply
1718627440
2 hours ago
[-]
> -Denial: "I know what signed overflow does on my machine."

Or you just not skip the introductory pages, that tell you what the language philosophy of C is, and why there is UB. Yes, UB can be a struggle, but the first four steps are entirely unnecessary. It means that you do not actually understand the core concepts of the very same language you are using, which is kinda stupid.

reply
whizzter
1 hour ago
[-]
I think the issue has been that the line between de-jure and de-facto behaviours has shifted over the years as compiler optimizations suddenly began relying on de-jure intrepretations of UB to increase performance while ignoring de-facto usage of the language.

When that started happened people became alarmed (oMG UB iS TeH BAD!) and since some old UB machines still had industry support (of organisations that actually participated in ISO meetings instead of arguing online) there was never any movement on defining de-facto usage as de-jure and the alarmist position became the default.

Personally I think the industry would've benefited from a Boring C (as described by DJB) push by people that would've created a public parallell "de-jure" standard that would've had a chance to be adopted by compiler creators.

reply
1718627440
1 hour ago
[-]
> I think the issue has been that the line between de-jure and de-facto behaviours has shifted over the years as compiler optimizations suddenly began relying on de-jure intrepretations of UB to increase performance while ignoring de-facto usage of the language.

I guess I am too young, and also too much a purist, because I start from the impression of what the language is, not what the implementations happen to do.

> Personally I think the industry would've benefited from a Boring C (as described by DJB) push by people that would've created a public parallell "de-jure" standard that would've had a chance to be adopted by compiler creators.

-O0

reply
greysphere
4 hours ago
[-]
The examples aren't really undefined behavior. They are examples that could become UB based on input/circumstances. Which if you are going to be that generous, every function call is UB because it could exceed stack space. Which is basically true in any language (up to the equivalent def of UB in that language). I feel like c has enough actual rough edges that deserve attention that sensationalism like this muddies folks attention (particularly novices) and can end up doing more harm than good.
reply
guerby
4 hours ago
[-]
Ada 83 has no UB on call stack overflow, from the reference manual :

http://archive.adaic.com/standards/83lrm/html/lrm-11-01.html

"STORAGE_ERROR This exception is raised in any of the following situations: (...) or during the execution of a subprogram call, if storage is not sufficient."

reply
veltas
4 hours ago
[-]
So it's just as useful as when your stack area ends with a page that will segfault on access, or your CPU will raise an interrupt if stack pointer goes beyond a particular address?

It's not safe though because throwing an exception, panicking, etc, is still a denial of service. It's just more deterministic than silently overwriting the heap instead. If the program is critical then you need to be able to statically prove the full size of the stack, which you can do with C and C++ with the right tools and restrictions.

reply
simonask
3 hours ago
[-]
Deterministic, well-defined behavior is inherently safer than undefined behavior. It allows you to diagnose the problem and fix it. UB emphatically does not, and I don't dare to think of how many millions of person-hours are wasted every year dealing with the results.
reply
bregma
1 hour ago
[-]
A segfault is considered safe if you're talking about functional safety because it results in a return to a defined safe state (RTDSS).

If a segfault leads to some other state you do not deem "safe", such as a single program gating access to a valuable asset with a default fail state of "allow", you just have a fundamental design flaw in your system. The safety problem is you or your AI agent, not the segfault.

reply
eru
4 hours ago
[-]
That's not true at all.

First, you can define what happens when stack space is exceeded. Second not all programs need an arbitrary amount of stack space, some only need a constant amount that can be calculated ahead of time. (And some languages don't use a stack at all in their implementations.)

Your language could also offer tools to probe how much stack space you have left, and make guarantees based on that. Or they could let you install some handlers for what to do when you run out of stack space.

reply
pjc50
4 hours ago
[-]
UB based on input can be an exploit vector.
reply
layer8
4 hours ago
[-]
Unvalidated input can always be an exploit vector.
reply
Ygg2
4 hours ago
[-]
Except in C, validation of user input can in itself be an exploit vector.
reply
layer8
3 hours ago
[-]
That’s true in other languages as well. Any programmatic task can end up being an exploit vector.
reply
pjc50
3 hours ago
[-]
No? That's the whole point of formal verification?

You can even kind of retrofit this to C. The classic example is "sel4". You just need a set of proofs that the code doesn't trigger UB. This ends up being much larger and more complicated than the C itself.

reply
rocketrascal
49 minutes ago
[-]
You can fail to verify something which you actually wanted to verify (i.e you made a proof of something else instead of the thing that mattered). See WPA2 KRACK as an example.
reply
greybeard69
4 hours ago
[-]
Turtles all the way down.
reply
stevenhuang
4 hours ago
[-]
The examples are unequivocally UB. Full stop.

How to think of this properly is that when you have UB, you are no longer under the auspices of a language standard. Things may work fine for a time, indefinitely even. But what happens instead is you unknowingly become subject to whimsies of your toolchain (swap/upgrade compilers), architecture, or runtime (libc version differences).

You end up building a foundation on quicksand. That's the danger of UB.

reply
flohofwoe
4 hours ago
[-]
> The examples are unequivocally UB. Full stop.

Tbh, already the first example (unaligned pointer access) is bogus and the C standard should be fixed (in the end the list of UB in the C standard is entirely "made up" and should be adapted to modern hardware, a lot of UB was important 30 years ago to allow optimizations on ancient CPUs, but a lot of those hardware restrictions are long gone).

In the end it's the CPU and not the compiler which decides whether an unaligned access is a problem or not. On most modern CPUs unaligned load/stores are no problem at all (not even a performance penalty unless you straddle a cache line). There's no point in restricting the entire C standard because of the behaviour of a few esoteric CPUs that are stuck in the past.

PS: we also need to stop with the "what if there is a CPU that..." discussions. The C standard should follow the current hardware, and not care about 40 year old CPUs or theoretical future CPU architectures. If esoteric CPUs need to be supported, compilers can do that with non-standard extensions.

reply
account42
3 hours ago
[-]
Not having unaligned access in the language allows the compiler to assume that, for basic types where the aligment is at least the size, if two addresses are different then they don't alias and writes to one can't change the result of reads from the other. That's a very useful assumption to be able to make for optimization - much more useful than yolocasting pointers in a way that could get you unaligned ones.
reply
flohofwoe
1 hour ago
[-]
> if two addresses are different ...

Eh, if the compiler knows that two addresses are different at compile time, it also knows how big the difference is.

reply
saagarjha
52 minutes ago
[-]
Usually this is not the case.
reply
stevenhuang
4 hours ago
[-]
I agree. I meant to elaborate more on how to think of UB.

For most C software on x86_64, UB is "fine" with very strong bunny ears. But it is preferable for one to, shall we say, write UB intentionally rather than accidentally and unknowingly. Having an awareness of all the minefields lends for more respect for the dangers of C code, it makes one question literally everything, and that would hopefully result in more correct code, more often.

On that note, on some RISC-V cores unaligned access can turn a single load into hundreds of instructions.

I think the problem is just that C is under specified for what we expect a language to provide in the modern age. It is still a great language, but the edges are sharp.

reply
leni536
3 hours ago
[-]
Undefined means that the ISO C doesn't define the behavior. An implementation is free to do so.
reply
simonask
3 hours ago
[-]
If they do, that is no longer an implementation of C. It is a dialect of C, and there are many (GNU C being the most popular), but there are real drawbacks to using dialects.

This is in contrast to the other category that exists, which is "implementation-defined".

reply
flohofwoe
1 hour ago
[-]
The thing is that the actual compiler behaviour matters more for real-world projects than what the C standard says. E.g. the C standard was always retroactive, it merely tried to reign in wildly different compiler behaviour at the time when the standard was new. It mostly succeeded, but still the most useful C and C++ compiler features are living in non-standard extensions.
reply
1718627440
2 hours ago
[-]
> If they do, that is no longer an implementation of C.

This is plain wrong. Undefined behaviour, means the C standard specifies no restriction on the behaviour of the program, which is what the implementation chooses to emit. An implementation can very well choose to emit any program it pleases, including programs that encrypt your harddisk, but also programs that stick to well defined rules.

reply
simonask
2 hours ago
[-]
Sure, but the point is that code written against such a compiler is not C and is not portable. It is written in a dialect of C, and that comes with drawbacks.

Writing C (or any language) means adhering to the standard, because that's the definition of the language.

reply
rocketrascal
30 minutes ago
[-]
You can't make any useful software in "Portable C" - or any portable language for that matter.

Side effects matter, and they are always non-portable/implementation defined/dependent on the hardware.

What printf() actaully does is implementation defined - what does "printing mean", does a console even exist? Maybe a user expects it to show graphical ascii/utf8 glyphs on a LCD display? Well, not every computer has that, so now what?

reply
skydhash
25 minutes ago
[-]
Maybe it’s a generation thing. Languages like ML and Lisp have many implementations, while newer languages like Perl and Python are steered by a single organization. It’s way easier for the latter to have a single source of truth.

The C standard reminds me of Posix. You have a rough guideline if you ever wanted to port a program, but you actually have to learn the new compiler and its actual behavior before doing so.

reply
IshKebab
4 hours ago
[-]
There are still modern CPUs that don't support misaligned access. It would be insane for C to mandate that misaligned accesses are supported.

However I do agree that just saying "the behaviour is undefined" is an unhelpful cop-out. They could easily say something like "non-atomic misaligned accesses either succeed or trap" or something like that.

> In the end it's the CPU and not the compiler which decides whether an unaligned access is a problem or not.

Not just the CPU - memory decides as well. MMIO devices often don't support misaligned accesses.

reply
1718627440
2 hours ago
[-]
> They could easily say something like "non-atomic misaligned accesses either succeed or trap" or something like that.

That means that the compiler must emit the read, even if the value is already known or never used, as it might trap. There is a reason for the UB!

reply
IshKebab
1 hour ago
[-]
No it doesn't. Compilers are only required to emit the read for volatile types. If the type is non-volatile, misaligned, and can be optimised out then it would be perfectly fine to omit it (that would be the "succeed" option).
reply
1718627440
49 minutes ago
[-]
If a trap is observable behaviour, then the compiler either needs to add code, that checks for the condition and then traps explicitly or it needs to actually perform the read. Currently it can be optimized out, because it is UB.
reply
thayne
4 hours ago
[-]
On hardware that doesn't support it, misaligned loads could be compiled to multiple loads and shifts. Probably not great for performance, and it doesn't work if you need it to be atomic, but it isn't impossible.
reply
gizmo686
3 hours ago
[-]
That still requires detecting when a misaligned load happens.
reply
IshKebab
1 hour ago
[-]
That is only really possible if you know the pointer is misaligned at compile time (which does happen, e.g. for packed structs). The examples in the article are for runtime misalignment. It would be crazy to generate code so that every function checked if every access was aligned at runtime.

(Note the normal way to handle that if the hardware doesn't actually support it is for the access to trap and then the OS or firmware emulates it.)

reply
account42
3 hours ago
[-]
For x86 SSE there are aligned instructions that will trap on unaligned access.
reply
account42
3 hours ago
[-]
Yes, this article is pretty much the definition of FUD.
reply
bestouff
5 hours ago
[-]
The problem of UB is not really that it may crash in some architecture. The real problem is that the compiler expects UB code to NOT happen, so if you write UB code anyway the compiler (and especially the optimizer) is allowed to translate that to anything that's convenient for its happy path. And sometimes that "anything" can be really unexpected (like removing big chunks of code).
reply
inkysigma
4 hours ago
[-]
One example along this path as an example is that every function must either terminate or have a side effect. I don't think one has bitten me yet but I could completely see how you accidentally write some kind of infinite loop or recursion and the function gets deleted. Also, bonus points for tail recursion so this bug might only show up with a higher optimization level if during debug nothing hit the infinite loop.
reply
account42
3 hours ago
[-]
Infinite loop without side effects == program stuck and not responding on user input and not outputting anything. That's not something a useful program will ever want to do.
reply
Certhas
2 hours ago
[-]
Not true, C++ made it so trivial infinite loops are not UB because it turns out they do have legitimate uses.

https://lists.isocpp.org/std-proposals/2020/05/1322.php

https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2024/p28...

reply
account42
2 hours ago
[-]
Yes, the C++ committee has been making some stupid decisions lately. This is not the only one.

Low level platform-specific code that needs to hot spin until an interrupt happens can use assembly for that part which it will need to do for the interrupt handler anyway.

reply
zarzavat
2 hours ago
[-]
reply
account42
2 hours ago
[-]
This is already UB without an infinite loop.
reply
xigoi
2 hours ago
[-]
The problem is when you accidentally write an infinite loop. In a different language, you run the code, see that it gets stuck and fix it. In C, the compiler may delete the function, making it hard to realize what is happening.
reply
account42
2 hours ago
[-]
This is not a problem that C or C++ programmers actually encounter, ever.
reply
1718627440
2 hours ago
[-]
Note, that this is not true for C.
reply
1718627440
2 hours ago
[-]
That's only true in C++ though, not in C.
reply
dzaima
2 hours ago
[-]
C does allow unconditional infinite loops (e.g. "while (1) { }" isn't UB) but still is UB if the controlling expression isn't constant (e.g. "while (two < 10) { }" is UB if two is a variable less than 10)
reply
1718627440
2 hours ago
[-]
Yes, that is a problem, but this is also the most useful feature and reason for UB. People that suggest to just define it or make it unspecified, miss, that the compiler being able to remove whole parts of a program is the point. When I write code, that is UB for certain inputs, it is because I do not intend the program to have any behaviour for these inputs. I do want the compiler to optimize those away or do anything that effects from the behaviour of the other defined cases. It is deeply satisfying to add some conditions triggering log strings and see that they do not occur in the binary, because they can be only reached via UB.
reply
eru
4 hours ago
[-]
Yes, a crash is about the most benign UB: at least it's highly visible.

In worse scenarios, your programme will silently continue with garbage, or format your hard disk or give attackers the key to the kingdom.

reply
rando1234
3 hours ago
[-]
The point in the article that 'It's not about optimisations' really got my attention. I've previously done some work where we wrote an analysis pass under the assumption that it executed last in the transformation pipeline and this was needed for correctness. The assumption was that since no further optimisations happened it was safe. Now I'm not so sure...
reply
account42
3 hours ago
[-]
That's a feature, not a problem.
reply
anilakar
5 hours ago
[-]
Removing code paths that the programmer has explicitly laid out in the source code should be made a hard compile error unless the operation has been tagged with an attribute (anyone who wants to add the unsafe keyword to C? ).

Another commenter suggested using LLMs, but I disagree. Having clangd emit warning squiggles for unchecked operations (like signed addition) would be a good start.

reply
flohofwoe
4 hours ago
[-]
> Removing code paths that the programmer has explicitly laid out in the source code should be made a hard compile error unless the operation has been tagged with an attribute (anyone who wants to add the unsafe keyword to C? ).

Dead code elimination is essential for performance, especially when using templates (this is basically what enables the fabled "zero cost abstraction" because complex template code may generate a lot of 'inactive' code which needs to be removed by the optimizer).

The actual issue is that the compiler is free to eliminate code paths after UB, but that's also not trivial to fix (and some optimizations are actually enabled by manually injecting UB (like `__builtin_unreachable()` which can make a measurable difference in the right places).

reply
peterfirefly
21 minutes ago
[-]
> free to eliminate code paths after UB

before.

reply
1718627440
2 hours ago
[-]
> The actual issue is that the compiler is free to eliminate code paths after UB

Not, that the compiler can also emit code paths before UB, as UB is a property of the whole program, not just of a single statement.

reply
amoss
5 hours ago
[-]
Dead code elimination is run multiple times, including after other optimizations. So code that is not initially dead may become dead after propagating other information. Converting dead code into an error condition would make most generic code that is specialized for a particular context illegal.
reply
gpderetta
2 hours ago
[-]
Consider:

   enum op_t{ add, mul };
   int exec(op_t op, int a, int b) {
       if(op == add) { return a+b; }
       if(op == mul) { return a\*b; }
   }

   c = exec(add, a,b);
Should be the compiler be prevented from inlining exec and constant-propagating op and removing the mul branch? What about if a and b are constants and the addition itself is optimized away?
reply
4gotunameagain
5 hours ago
[-]
This is trickier than it initially seems. Using preprocessor directives to include or exclude swaths of code is a very common thing, and implementing a compiler error as you described would break the building of countless C codebases.
reply
nullpwr
14 minutes ago
[-]
Excellent post. But it's addressed to the wrong people.

The problem lies with compilers, not with the language and its specification, or with the creators of the C programming language.

Anyone can write a compiler that transforms all undefined behaviors (UB) into defined behaviors (DB). And your compiler will be used by people, including me.

reply
parasti
3 hours ago
[-]
I have never in my 20 years of writing C heard so much about undefined behavior as I have in the past 6 months on Hacker News. It has never entered the conversation. You write the code. If it doesn't work, you debug it and apply a fix or a workaround. Why does the idea of undefined behavior in C get to the front page so consistently?
reply
simonask
2 hours ago
[-]
Excuse me, what? I was writing both C and C++ 20 years ago, and UB was a huge part of the conversation (and the curriculum) back then as well.

There were a few high-profile "scandals" around GCC 3.2 (IIRC) because the compiler finally started much more aggressively using UB in optimizations, which was a reason that lots of people stayed on GCC 2.95 for a very long time. GCC 3.2 came out in 2002.

reply
parasti
2 hours ago
[-]
Started in 2005. Never ever did anyone complain about UB in my years of writing C code and patching other people's C code. I knew it exists - as a spec quirk. (Admittedly, never wrote a compiler and never used anything except gcc and clang.)
reply
bregma
1 hour ago
[-]

    There are more things in heaven and earth, Horatio,
    Than are dreamt of in your philosophy
You've probably been churning out possibly malformed code for years. Now you're becoming aware of your shortcomings. This is usually considered the transition from intermediate- to senior-level programmer.
reply
Etheryte
3 hours ago
[-]
Because the production environment might be a completely different architecture, these details matter a lot. Works on my machine is not useful if your actual target is a small embedded system on top of a cell tower in the middle of nowhere. Granted, most people don't work on stuff like that, I imagine the vast majority of devs here are web developers, but even still it's an interesting discussion even if you haven't run into it yourself. Maybe even more so in that case.
reply
spacedcowboy
2 hours ago
[-]
Um, as an embedded developer, you don't develop the code to run on your machine, you develop it to run on the same target as you expect to deploy to, sitting on your desk next to you.

I have lots of my code running day-in, day-out on literally hundreds of millions of machines. The approach to "getting it working" is exactly OP's.

I'll admit to being pretty defensive and anal in checking values and return-codes (more so than most, I suspect), and I'm a firm believer in KISS principles in software engineering ("solving hard problems with complicated code is easy, solving them with simple, understandable algorithms is the hard bit") but generally there's no real difference in approach to the code I write to work on my workstation, and the code I write to work in the field.

reply
dmpk2k
2 hours ago
[-]
Embedded developers often suffer under archaic toolchains. There's plenty of reasons for that, but one of them is UB: a newer version of the compiler can completely change an embedded program's behaviour.
reply
spacedcowboy
1 hour ago
[-]
Where I was it was quite the opposite. The bloody compiler guys kept on updating the compiler, and we were required to use the OS-delivered one. Since we were often using pre-release OS's, the toolchain could change every week.

It did make you write robust and defensive code, though...

reply
summa_tech
1 hour ago
[-]
Hacker News is still skewed towards people interested in programming languages (as opposed to actually programming). Probably some sort of Y-combinator Lisp heritage. There's also a persistent minority of CS grads who think that developing / using new programming languages is the most fascinating thing in the world, and some of them hold on to that thought.

It's reasonable that such people would also be interested in design aspects of languages, and UB in C is in that field. Though I would argue that a lot of it was originally accommodating old CPU architectures without compromising performance too badly, and about as much a "design choice" as wheels being round...

reply
dminik
1 hour ago
[-]
If only it was that easy: https://silentsblog.com/2025/04/23/gta-san-andreas-win11-24h...

The real answer is that proponents of languages like C seem to completely disregard the dangers/difficulty of hitting/difficulty of fixing UB. Proponents of languages like Rust overstate it instead. Pointless wars/drama is fun to read and gets clicks.

reply
kzrdude
1 hour ago
[-]
After the rise of Rust, it has gained more visibility? But some people were interested in C in this way long ago too, I used to hang out in some godforsaken irc channel where people competed in out-pedanticing each other over the C standard.

I trust your historical C usage was more productive than that..

reply
sethev
2 hours ago
[-]
I wonder if it’s just the colorful metaphors and an opportunity to bring out examples of surprising behavior. Plus it’s a topic that can always stir up debates.
reply
aldanor
2 hours ago
[-]
If there's no UBs then what will we programmers do, there won't be enough to debug and fix?
reply
jakobnissen
2 hours ago
[-]
I would guess that the continued success of Rust have shown that we don’t have to live with the user-hostility of C in order to write system programs. Therefore, people are understandably growing less and less patient with C and its unending bullshit.

Although I haven’t noticed a spike the last 6 months, just a slowly increasing realization that C isn’t fit for humans and should go the way of asbest: Don’t use it for anything new, and remove it where it already exists, unless doing so would be too expensive or disruptive.

reply
benj111
2 hours ago
[-]
I don't think C is hostile. C has UB for good reason. The problem is UB has been hijacked by the compiler writers for performance gains.

Personally I like C because you should have a good idea of what it's going to do. Other languages feel like a black box, and I start having to fight them far too often. But I say that as a hacker of low level stuff, not as someone who's paid and working on higher level stuff, so that is probably a niche view.

reply
rramadass
1 hour ago
[-]
Because most of the people who post/write these articles do not actually know the C language specification nor understand its design.

Understanding three important concepts properly in C allows one to easily identify what can/cannot result in UB viz. 1) Expressions 2) Statements 3) Sequence Points and "Single Update Rule". It is not that hard at all.

I wrote about it here with links to further reading provided - https://news.ycombinator.com/item?id=48144734

reply
benj111
2 hours ago
[-]
1. It's been talked about for much longer than that.

2. You don't really appreciate the issue. Signed integer overflow is undefined. If you check for that overflow after the fact the compiler can, and demonstrably has pretended that the overflow can't happen and optimised away your overflow check.

You may not even come across that failure mode to know to 'fix' it. And good luck finding the issue unless you know about UB and what the compiler can and will do in such situations.

reply
account42
3 hours ago
[-]
There are a lot of Rust/whatever hipsters here that have defined their whole identity around hating C and C++.
reply
virtualritz
1 hour ago
[-]
Like the author of the article, I write C/C++ since 30 years. Mostly close-to-the-metal code around computer graphics. Actually: wrote.

After switching to Rust five years ago I agree with all the Rust hipsters as far as disliking those languages go.

I just don't talk about it a lot. If every Rust person I know that was a C/C++ developer before was as outspoken about what they think of the latter, you'd see that these people are a majority.

We're just old hands who like to use stuff that works. And most of us don't get attached to code or languages.

It's also difficult to yourself that you were never in command of a language as far as UB/other footguns go, as much as you thought. Or ever, for your enire career. For me that self-realization about C/C++ (enabled by Rust) was a turning point.

Lately you can read about the dichotomy re. AI use.

I.e. developers who define them themselve through what they build/ideas embracing LLMs for what they can do.

I.e.: I am what I build.

Whereas developers for whom software engineering is a craft that defines them hate them openly.

I.e.: I am how I build.

Now this seems to suggest to me that maybe Rust developers who openly hate C/C++ squarely belong to the latter group whereas the silent ones belong to the former. It's builders vs programmers. Just different world views.

Also you can not dislike something and still not speak about it because you decided to not care.

reply
hnarn
2 hours ago
[-]
Ironically, by stereotyping ”Rust hipsters” you are painting yourself out as a stereotype as well. Knee-jerk comments like yours add nothing to the discussion. Rust exists for a reason, it solves real problems, but it’s not suitable for everything. These are indisputable facts and by discarding every mention of Rust as coming from ”hipsters” with no understanding, you are doing the exact same thing that you would accuse them of. ”Use Rust for everything” and ”Rust is useless for everything” are equally vapid and meaningless statements designed for nothing but trolling and showing ignorance.
reply
keyle
2 hours ago
[-]
Computers used to be cool; now they're dangerous.

Every company keep harping on about safety and being exposed (being in the news): so the narrative against 'unsafe' is up the wazoo.

The new world is basically a bunch of city dwellers who haven't seen raw nature and you show them a lawn mower, they freak out. Blades that spin?!?!?! Madness!!

reply
pjc50
2 hours ago
[-]
If everything is going to be dependent on computers, it's probably important that they work and remain under their owner's control rather than whichever NK or Chinese hacker group gets to them first.

Can't talk about C without CVE.

reply
keyle
2 hours ago
[-]
Yeah, npm, all the yaml state machines, & now MCP Gemini --yolo entered the chat.

If you think C is the problem, you'll come to the eventual conclusion that humans are the problems, and greed. Don't hate the player, hate the game etc.

C was invented so you don't have to write assembly. It wasn't invented to expose devices to billions of other devices.

reply
jb1991
1 hour ago
[-]
Some of the C++ code in this article has not been idiomatic in over a decade, and would be considered a code smell today. The language has evolved into quite a different language than when it was first created. As soon as I saw all of those raw pointers and direct pointer access, it was clear that at least part of this article should be taken with a grain of salt.

The other obvious issue with the overall perspective is that C and C++ are being thrown together directly as if somehow they’re nearly the same language, but they are really very far apart nowadays.

reply
debugnik
1 hour ago
[-]
I was about to call out that the code is supposed to be C and not C++, but I double checked and I realised it actually says std::atomic<int>, not atomic_int!
reply
jb1991
1 hour ago
[-]
Exactly, this is very old C++ on display in this article. It’s certainly not as safe as a language like Rust, but quite a lot of undefended behavior and things that will shoot yourself in the foot have been changed over the last 10 years.

Most C++ today will be immediately obvious and not accidentally mixed up with C.

reply
JonChesterfield
38 minutes ago
[-]
Well, you can't write malloc in conforming C, which hurts rather more than remembering to write bitcast as memcpy on char pointers.

Doesn't matter though because you aren't writing standards conforming C. You're writing whatever dialect your compilers support, and that's probably (module bugs) much better behaved than the spec suggests.

Or you're writing C++ and way more exposed to the adversarial-and-benevolent compiler experience.

The type aliasing rules are the only ones that routinely cause me much annoyance in C and there's always a workaround, whether if it's the launder intrinsic used to implement C++, the may_alias attribute or in extremis dropping into asm. So they're a nuisance not a blocker.

reply
debugnik
5 hours ago
[-]
As much as I agree with the intro, these examples aren't good and the overall article is just a veil for pushing LLM coding.
reply
gblargg
2 hours ago
[-]
Agreed. One after another these are standard things you avoid when writing portable code (or don't need, like accessing the object at address 0). They come across like from someone who wants to write whatever they want and have it work the same on everything. To make it into a language that allows this would remove its advantage of being able to write to the platform when you want to.
reply
boxed
4 hours ago
[-]
Not good how? Are they TRUE? If so that's super bad.
reply
IshKebab
4 hours ago
[-]
They are true but I agree it's not a great article. C has an unending list of UB and given the title I was expecting a more comprehensive survey, but they actually just picked a few that are both fairly well known and not very interesting.
reply
thomashabets2
3 hours ago
[-]
Author here.

As I stated:

> The following is not an attempt at enumerating all the UB in the world. It’s merely making the case that UB is everywhere, and if nobody can do it right, how is it even fair to blame the programmer? My point is that ALL nontrivial C/C++ code has UB.

It's about that point, not about how to avoid it. Because you can't.

reply
HelloNurse
3 hours ago
[-]
Some of the examples are somewhat formally true in theory and bullshit in practice; some are quite hallucinatory.

  - Creating a potentially troublesome misaligned int pointer is a precisely localized and completely explicit user mistake, not something that just happens because it's C.
  - Passing signed char to character classification functions that expect an unsigned char (disguised as an int) is a very specific dumb user error. The C standard could specify that all negative inputs, including EOF and invalid signed char values, are classified as not belonging to the character class, but I doubt the current undefined behaviour in isxdigit() etc. implementations ever went beyond accepting invalid inputs.
  - Casting floating point values to integer values in general requires taking care of whether the FP values are small enough to be represented and what to do with NaN and Inf values: not the language's responsibility. C offers a toolbox of tests, not ready-made application specific error handling.
  - Expecting C to handle "address zero" in physical memory in ways that conflict with NULL in source code denotes a complete lack of understanding of what a program is. Where stuff in an executable is loaded in memory, in the rare cases when it matters, can surely be affected with platform specific extensions, possibly at the level of linker commands with nothing appearing in the C source code.
reply
thomashabets2
3 hours ago
[-]
Author here.

So I see your counter points are all "so just don't do that, then".

And the point of my post is that this particular "just don't do that, then" has never been achieved by humans.

If if there's no example of a program without these bugs in a language, then I do think it's fair to blame the language. A knife with 16 blades and no handle.

> Expecting C to handle "address zero" in physical memory in ways that conflict with NULL in source code denotes a complete lack of understanding of what a program is.

Like the post says, it's rare that programmers actually want a pointer to memory address zero. But in my experience most programmers who even encounter that have this "complete lack of understanding", as you put it.

reply
HelloNurse
2 hours ago
[-]
"Just don't do that" is the correct approach to errors, even when they are easy to overlook and the programming language provides many opportunities for mistakes.

For example, you seem to underestimate how wrong placing negative values in a signed char is: ordinary character encodings do not use negative codes, so either those negative values are not characters and they have no business being treated as such, or something strange and experimental is going on.

reply
dminik
1 hour ago
[-]
Just don't fall bro. It's that easy. No railings required.
reply
maple3142
3 hours ago
[-]
Is this a correct understanding of UB in C? A program P has a set of inputs A that do not trigger UB, and a complementary set of inputs B that do trigger UB. A correct compiler compiles P into an executable P'. For all inputs in A, P' should behave the same as P. However, for any input in B, the is absolutely no requirements on the behavior of P'.
reply
simonask
3 hours ago
[-]
Intuitively yes - the program will be compiled as if B-inputs are never passed to the program, and that can include eliminating code that tries to detect B-inputs.
reply
mbrock
2 hours ago
[-]
This is a description of an imaginary compiler, evoked by the ANSI/ISO standards documents, which has never existed and will never exist. To understand what the program will do, you just have to understand the compiler behavior on your target platforms. A helpful intuition pump is: imagine the ANSI/ISO specifications simply do not exist; now what? Well, you just continue your engineering practice, the way you would for any of the myriad languages that never even had a post hoc standards document.
reply
simonask
1 hour ago
[-]
> just

That word is carrying a lot of weight here. Compilers are unbelievably complex these days, and it's impossible for any one human to fully understand the entire compilation process, including the effects of any arbitrary combination of compiler flags.

Any assumptions you have about what the compiler does in the face of UB will collapse on the next patch release of that compiler, or the moment somebody changes the compiler flags, or the moment somebody tries to compile the code for a slightly different OS, not to mention architecture.

There is no other way to understand what C compilers do than reading the standard.

reply
mbrock
1 hour ago
[-]
Yet the standard does not tell you what the compilers do.

Linux works on a wide variety of platforms. It also relies on those platforms behaving predictably with respect to what the standard leaves undefined.

This description of ISO UB as a totally insane wonderland of random, malevolent semantics just doesn't describe reality.

reply
saagarjha
46 minutes ago
[-]
Up until the compilers do something to your code that you don’t understand.
reply
mbrock
38 minutes ago
[-]
yeah then I have to learn how it works and what it assumes and how I can control it and maybe switch to a more well behaved compiler if it's truly insane
reply
JonChesterfield
35 minutes ago
[-]
Not imaginary. Eliding checks on nullptr and integer overflow were both implemented, shipped, miscompiled the linux kernel and grew flags to disable them. I expect there are more if one goes looking.
reply
mbrock
19 minutes ago
[-]
Well yeah that just means some aspects of the imaginary compiler were in some configurations approximated by some historical compiler versions and were in some cases rejected by the community (which cares about sane semantics even for behavior left undefined by ANSI/ISO) and in some cases left in as defaults but made trivially configurable for anyone who wants to define the undefined behavior.
reply
Retr0id
1 hour ago
[-]
GCC -O1 and clang -O1 will both optimize this function under the assumption that inputs that cause signed integer overflow are never passed:

    int will_overflow(int a, int b) {
        int sum = a + b;
        if (b > 0 && sum < a)
            return 1;
        return 0;
    }
reply
mbrock
1 hour ago
[-]
Right, good example, and both GCC and Clang offer well understood parameters for deciding, per compilation unit, what behavior you want for signed overflow (-fwrapv, -fno-strict-overflow, etc), so in reality it's quite far from spooky arbitrary nasal demons.
reply
skydhash
6 minutes ago
[-]
Wouldn’t be better to check both inputs before against the max value of that type instead of actually doing the overflow?
reply
Retr0id
4 minutes ago
[-]
There are lots of better ways of doing this, but knowing why this one is bad/wrong requires the mental model described upthread.

(But also, what you describe would be incorrect, since two <MAX values can add to a value that is >MAX, and overflow)

reply
1718627440
2 hours ago
[-]
Yes, that's a good summary.
reply
danborn26
1 hour ago
[-]
The scariest part is how many production systems rely on undefined behavior without anyone knowing until a compiler update breaks everything.
reply
rom1v
3 hours ago
[-]
A concrete example of undefined behavior caused by an unaligned pointer: https://pzemtsov.github.io/2016/11/06/bug-story-alignment-on...
reply
gblargg
2 hours ago
[-]
Specifically on x86 where it's assumed that won't cause problems.
reply
__0x01
5 hours ago
[-]
> A problem with this is that in order to confirm the findings, you’ll need an expert human. But generally expert humans are busy doing other things.

The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.

LLM generated code will eventually contain UB.

EDIT: added "eventually"

reply
flohofwoe
4 hours ago
[-]
It would already help a lot when the C and C++ standards start to clean up the list of Undefined Behaviour (e.g. there's a lot of nonsense UB currently in the C standard which could easily become Defined Behaviour - like the "file doesn't end in a new-line character" thing):

https://gist.github.com/Earnestly/7c903f481ff9d29a3dd1

reply
layer8
4 hours ago
[-]
The easy cases like you cite are also those that don’t cause problems in practice. I’m not sure that would help all that much, other than to slightly reduce internet criticism.
reply
talkin
3 hours ago
[-]
Fixing easy cases makes the list shorter, so enables more focus on harder cases.

And it also signals that you actually do want to improve, just a little bit of boy scout rule goes a long way.

reply
gpderetta
2 hours ago
[-]
The issue is that the list is infinite (anything not specified is UB), so actually removing any finite amount of UB from the list won't make it shorter.

(only slightly tongue-in-cheek, I do believe that removing silly things is worthwhile).

reply
1718627440
2 hours ago
[-]
The list of UB categories and rules is not infinite. The list of UB programs is, as is the list of all non UB programs.
reply
gpderetta
47 minutes ago
[-]
It is not obvious to me that the list of categories is not infinite (unless the final category is "everything else" of course)
reply
thomashabets2
2 hours ago
[-]
Author here.

> The article suggests using LLMs to identify and fix UB. However as per the above, I think the issue is that we need more expert humans.

Yup. But the point of the article is that even expert humans cannot do this alone. And as I wrote, LLM+junior won't suffice either. We need LLM+senior experts.

And it's a problem that we have way more existing UB than expert capacity.

Now, will LLMs and experts both miss UB in some cases? Of course. There's no 100% solution. But LLMs, I claim, will find orders of magnitude more, with low false positive, than any expert. Even if these expert humans (like in the OpenBSD case for the two bugs I found, one of which was UB) are given more than three decades to do it.

I didn't even use the best model, complex code target, or time. I just wanted to choose a target that has a high chance of having very good experts already having audited it.

reply
eru
4 hours ago
[-]
Our LLM powered coding assistance are pretty good at doing lots of busywork that doesn't require all that much smarts. So they can supervise running our UB checks, like Valgrind, and making the linters happy.
reply
lelanthran
4 hours ago
[-]
> LLM generated code will eventually contain UB.

Yes.

Even in languages other than C (i.e. you will get behaviour that nothing in the input specified).

When LLMs generate code, all languages have UB.

reply
eru
4 hours ago
[-]
That's a bit silly.

UB means literally no restrictions. So if you standard says 'you have to crash with an error message' that's already no longer UB.

reply
lelanthran
4 hours ago
[-]
> So if you standard says 'you have to crash with an error message' that's already no longer UB.

Sure. For crashes. But when you instruct an LLM to do something, the output is probablistic, so you may get behviour that is unexpected and/or unwanted.

Like storing security tokens in code. Or nuking the production database.

reply
rurban
4 hours ago
[-]
Very bad advice. Of course good new LLM's know about UB, but you still need to use ubsan (ie - fsanitize=undefined), and not your LLM.
reply
formerly_proven
4 hours ago
[-]
Coding agents write unsound Rust any day, too. unsafe impl Send … is much easier than fixing a bad design and it might even work momentarily.
reply
wyldfire
41 minutes ago
[-]
Maybe we should criminalize writing articles about Undefined Behavior that have a "So what do we do now?" subheader but omit any mention of UBSan.
reply
mjs01
3 hours ago
[-]
Integer promotion seems to be the source of many signed integer overflow UB. Why does C have it? Does integer promotion ever have a good part?
reply
saagarjha
43 minutes ago
[-]
Yes, it simplifies a lot of code that would otherwise be littered with casts.
reply
peterfirefly
5 minutes ago
[-]
Could be fixed by having a nicer casting syntax (like Rust) or by not having so damn many scalar types that are used in practice.

"Explicit casts only" worked fine in Modula-2, which doesn't have as many scalar types.

reply
keyle
2 hours ago
[-]
When talking UB, putting C and C++ in the same basket is basically like comparing drunk driving a car and riding a bicycle sober... Both means of transport, very different experience.
reply
lelanthran
3 hours ago
[-]
I read through this in detail... Is it just me, or are these things that are invoked by intentionally bypassing the typing?

I mean, you have to go out of your way and use a cast to get the UB in the first example.

For the `isxdigit` implementation, using a parameter to index into an array without a length check is pretty suspect already. I don't think any of my code actually indexes an array without checking the length in some way.

For the float -> int conversion, converting a float to an int without picking a conversion does not make sense in the first place - math.h has rounding and ceiling functions.

> For all you know the compiler has no internal way to even express your intention here.

I'm human, not a compiler, and even I cannot tell what the intention is behind trying to call NULL as a function. What exactly is expected to happen?

> Because the argument needs to be a pointer, and the NULL macro may be misinterpreted as an integer zero.

I don't think this is true for C. The NULL macro is defined to be a pointer in the C standard, AFAIK. Just because comparisons with zero are allowed, does not imply that the standard implicitly promotes NULL to `int`.

I think only the final one is of note (the 24-bit shift assigned to a uint64_t).

reply
account42
2 hours ago
[-]
> I don't think this is true for C. The NULL macro is defined to be a pointer in the C standard, AFAIK. Just because comparisons with zero are allowed, does not imply that the standard implicitly promotes NULL to `int`.

Probably confusion with C++ where NULL is 0 which is a special case that can be implicitly cast to both integers and pointers, unlike non-zero constants. C doesn't need this because it doesn't require explicit casts from void pointers to others.

reply
elnatro
1 hour ago
[-]
Is there a way to avoid undefined behavior Im C then? Could we write a new C compiler that adds some checks and fixes (e.g. raise documented exceptions) to each undefined behavior?
reply
peterfirefly
3 minutes ago
[-]
ubsan.

Doesn't catch all of it.

reply
saagarjha
42 minutes ago
[-]
Not all of them but there are many tools that can try to define behavior for this code to help shake them out of your codebase.
reply
u1hcw9nx
1 hour ago
[-]
That post is just a hyperbolic rhetorical piece, not even a good technical shade. There are plenty of tools that restrict C into defined behavior subset. HN is just not aware of them. NASA, Aerospace and car industry are big customers, static analyzers and compilers.

Good open source ones:

Frama-C

IKOS (from NASA)

reply
elnatro
22 minutes ago
[-]
It’s been a while since I programmed in C. Thank you for these resources.
reply
codeflo
3 hours ago
[-]
> The compiler, and really the underlying hardware too, is playing a game of telephone with your UB intentions.

The part about hardware is wrong BTW. In all the cases about null pointers and out-of-bounds access and integer overflow and whatnot, the hardware semantics are clearly defined, and the assembler code does exactly what is written. The way modern compilers act on your code makes C less safe than assembler in that sense.

reply
thomashabets2
2 hours ago
[-]
Author here

> The part about hardware is wrong BTW

Could you be more specific? I think by "wrong" you may mean "not actually relevant to UB", and you're right about that. If that's what you mean then that part is not for you. It's for the "but it's demonstrably fine" crowd.

> the hardware semantics are clearly defined

Yup. The article means to dive from the C abstract machine to illustrate how your defined intentions (in your head), written as UB C, get translated into defined hardware behavior that you did not intend.

I'm not saying the CPU has UB, and I wonder what part made you think I did.

That's what I mean game of telephone. The UB parts get interpreted as real instructions by the hardware, and it will definitely do those things. But what are those things? It's not the things you intended, and any "common sense" reading of the C code is irrelevant, because the C representation of your intentions were UB.

reply
weinzierl
5 hours ago
[-]
A fun one that'd fit list be sequence point violations like

    i = i++
reply
radiospiel
5 hours ago
[-]
Fun, sure, but also GCC and Clang will both warn with -Wall (-Wsequence-point / -Wunsequenced).
reply
leni536
3 hours ago
[-]
Only in C, that one is defined in C++.

edit: I'm not sure it's even undefined in C.

reply
account42
3 hours ago
[-]
This would also be a code smell even if it was well defined.
reply
bvrmn
3 hours ago
[-]
I really like Zig's approach to UB. Especially alignment is a part of type. And all this wordy builtins for conversions. Starring to it makes you think what you doing wrong with data model it requires now 3 lines of casting expression.
reply
akiarie
3 hours ago
[-]
C is still, by far, the simplest language that we have.

Although many newer languages are safer (with the exclusion of Rust, primarily by being slower) the same kinds of issues that are there in C are there in these languages, their effects are just harder to see.

People complain about C as though they know how to fix it.

reply
simonask
3 hours ago
[-]
C is not a simple language in the sense that writing software in C is simple, and I think that's the only useful way to understand the word "simple" in this context.

Brainfuck is "simple" by any other definition as well, but that's not a useful quality.

reply
spacedcowboy
2 hours ago
[-]
C is a far simpler language than, for example, Swift. It's cognitive load in order to actually write something is pretty small - even the authors state that their book about C is intentionally slim because the concepts to understand are not that many.

That doesn't mean the C is a safer language than Swift, or a less-capable language than Swift. But in terms of "easy to understand along the happy-path", it's a lot easier to get going in C.

Swift, for example, bakes a whole load of CS-degree-level ideas and concepts into the basic language with its optionals, unwrapping, type-inference, async/await, existential types, ... ... ... . C doesn't do any of that. There are (many!) more footguns in C, but the language is less complex as a result.

Brainfuck is not at all simple, from that point of view. This is a valid Brainfuck program:

>+++++++++[<++++++++>-]<.>+++++++[<++++>-]<+.+++++++..+++.[-]>++++++++[<++++>-]<. >+++++++++++[<+++++>-]<.>++++++++[<+++>-]<.+++.------.--------.[-]>++++++++[<++++ >-]<+.[-]++++++++++.

This is the equivalent C program

#include <stdio.h> int main() { printf("Hello world!\n"); }

One of these is far simpler than the other.

[edit: changed to make the examples do the same thing]

reply
simonask
1 hour ago
[-]
The point I'm getting at is that your definition of "simple" (a word that should be banned among programmers) is not useful, if it is even meaningful.

The brainfuck example is "simpler": Only 8 kinds of tokens! Not really useful, though.

The cognitive load of _actually delivering software_ written in C is immensely greater than doing so with Swift, or Rust, or Python, or Java, even Zig, despite all of those leveraging much heavier machinery in order to deliver a friendlier abstract model for you to program against.

The tragedy of C is that, in addition only delivering very baseline abstraction tools, it also adds its own set of seemingly arbitrary rules and requirements that come from nowhere but the C standard. Fictitious limitations to suit a bygone era. The abstract model of C is fine in some places, but definitely not fine in other places, and my hypothesis is that most UB in practice comes from a mismatch between programmer intuitions and C's idiosyncracies.

reply
spacedcowboy
1 hour ago
[-]
Calling something "simple" to use and learn is a valid use of the word, sorry. Not going to stop doing that.

> The cognitive load of _actually delivering software_ written in C is immensely greater than doing so with Swift, or Rust, or Python, or Java, even Zig, despite all of those leveraging much heavier machinery in order to deliver a friendlier abstract model for you to program against

Sorry, I couldn't disagree more.

I find the simplicity of C to be elegant. You know the rules; it's like the entire C language is the 1-page summary of the encyclopaedia of C++ or Swift or Java, or (insert more-modern language here). The key to working well in C is in defining modular code with well-understood interfaces. I've got 40 years of programming in C so far, and the nightmare stories ran out after the first few years. Programming discipline is a thing.

Similarly, ObjC is a far superior, much simpler, object-oriented language than C++, there's about 15 different things over C, and you know the language. Template metaprogramming. Phooey! You'll still have to learn object-orientated programming semantics, but it's a "simple" language.

BTW: If you think the brainfuck language example is in any way easier to understand than the C one, I think you might need medication. /j

reply
saagarjha
40 minutes ago
[-]
> I've got 40 years of programming in C so far, and the nightmare stories ran out after the first few years.

You need to find something more interesting to do ;)

reply
dns_snek
3 hours ago
[-]
Can you elaborate what do you think C has in terms of simplicity that Zig doesn't, and which "same kinds of issues" do you think it has?

I'm not an expert in either language but my anecdotal experience disagrees with this - writing Zig has been far simpler and less error-prone than writing C.

reply
fjfaase
3 hours ago
[-]
Is comparing a signed integer with an unsigned integer UB? I resently wrote some code and compiled it with gcc to x86_64 (without optimization) that returned an incorrect answer.
reply
Karliss
2 hours ago
[-]
No UB, but the integer promotions rules apply.

When comparing signed and unsigned integers of same size the signed one will be converted to unsigned. In a reasonably configured project compiler will warn about it.

In case of integers smaller than int, promotion to int happens first.

In case of signed and unsigned integers of different size, the smaller one will be converted to bigger one.

reply
benchloftbrunch
2 hours ago
[-]
It's not UB. Integer promotion applies, the signed int is implicitly coerced to unsigned (or the other way around - don't remember which.)
reply
y42
2 hours ago
[-]
shameless plug, it's part of the Nerd Encyclopedia: it's also called "nasal demons".

https://nickyreinert.de/2023/2023-05-16-nerd-enzyklop%C3%A4d...

reply
ricardobeat
2 hours ago
[-]
I’ve been heavily invested in https://c3-lang.org/ the past couple months. How does it look from this perspective to someone with C experience?
reply
synergy20
1 hour ago
[-]
if c is more ub unsafe than it seems,what is the solution here
reply
justmarc
2 hours ago
[-]
The art is actually making sure it all stays defined behavior
reply
raluk
5 hours ago
[-]
In C / C++ there are two kinds of undefined behaviour. One is where there is written in standard what UB is. Another one is everthing else that is not in standard.
reply
wiseowise
5 hours ago
[-]
reply
thaumasiotes
5 hours ago
[-]
Technically, that's only one kind, because it's written in the standard that anything not mentioned in the standard is undefined behavior.
reply
cepepe
5 hours ago
[-]
One kind, but two different classes of undefined behaviour.
reply
alper
2 hours ago
[-]
Isn't the article mostly saying that SPARC sucks?
reply
NooneAtAll3
57 minutes ago
[-]
feels like https://xkcd.com/1499/

the only people complaining about being able to do awful things are people that do awful things

reply
gritzko
54 minutes ago
[-]
- a metal bar always sinks

- unless you are trying to sink it in mercury. then it floats

- unless it is an uranium bar

- go sink uranium bars in mercury yourself

reply
mbrock
4 hours ago
[-]
most languages don't even HAVE a specification so in most languages literally EVERYTHING everything is undefined behavior
reply
oersted
4 hours ago
[-]
UB doesn't mean that it is not specified (actually it is often very well specified), it means that compilers can and do assume that such code patterns will not be present. Those cases may not be considered and can lead to unexpected behaviour.

Additionally, some (most?) UB is intentionally UB so that optimisers are free to do fancy tricks assuming that certain cases will never happen. Indeed, this is required for high performance. If they do happen, again, it can lead to unexpected behaviour.

PS: Most languages that don't have a specification declare their primary implementation to be specification-as-code. Rust is an example of that, and it does still have UB: the cases that the compiler assumes will not happen.

reply
mbrock
3 hours ago
[-]
undefined behavior is the behavior of code patterns "for which this International Standard imposes no requirements" and the behavior is in fact almost always predictable and agreed upon by compiler vendors and the users of the language, which is why you are able to use programs that rely on undefined behavior probably every single second you are using the computer

edit: for example I'm typing this into Safari which means probably every key press and event is going through JSC JIT compiled functions—which have, structurally and necessarily and intentionally, COMPLETELY undefined behavior according to the spec—and yet it miraculously works, perfectly, because the spec doesn't really matter

reply
saagarjha
38 minutes ago
[-]
It matters when your JSC JIT is full of security holes
reply
mbrock
38 minutes ago
[-]
ok what's the alternative?
reply
saagarjha
35 minutes ago
[-]
Removing the undefined behavior
reply
mbrock
23 minutes ago
[-]
you mean removing the JIT?
reply
saagarjha
7 minutes ago
[-]
No
reply
benj111
2 hours ago
[-]
The issue for me with posts like this is that it misses the issue.

Unaligned pointer accesses are UB because different systems handle it differently. This 'should' be to allow the program to be portable by doing what the system normally does.

Instead it's been highjacked by compiler writers, with the logic that "X is UB, therefore can't happen, therefore can be optimised away."

Int c = abs(a) + abs(b); If (a > c) //overflow

Is UB because some system might do overflow differently. In practice every system wraps around.

That should be a valid check, instead it gets optimised away because it 'can't' happen.

C gives you enough rope to hang yourself. The compiler writers don't trust you to use the rope properly.

reply
veltas
5 hours ago
[-]
From the ANSI C standard:

  3.16 undefined behavior: Behavior, upon use of a nonportable or erroneous program construct, of erroneous data, or of indeterminately valued objects, for which this International Standard imposes no requirements.  Permissible undefined behavior ranges from ignoring the situation completely with unpredictable results, to behaving during translation or program execution in a documented manner characteristic of the environment (with or without the issuance of a diagnostic message).
Is it just me or did compiler writers apply overly legalistic interpretation to the "no requirements" part in this paragraph? The intent here is extremely clear, that undefined behavior means you're doing something not intended or specified by the language, but that the consequence of this should be somewhat bounded or as expected for the target machine. This is closer to our old school understanding of UB.

By 'bounded', this obviously ignores the security consequences of e.g. buffer overflows, but just because UB can be exploited doesn't mean it's appropriate for e.g. the compiler to exploit it too, that clearly violates the intent of this paragraph.

reply
dataflow
5 hours ago
[-]
> but that the consequence of this should be somewhat bounded or as expected for the target machine.

Aren't "unpredictable results" and "no requirements" contrary to the idea that the behavior would be "somewhat bounded"?

reply
veltas
5 hours ago
[-]
Notice though "ignoring the situation" thru "documented manner characteristic of the environment". Even though truly you can read this in an uncharitable way, you could also try and understand the intent of this paragraph, and I think reading it for its intents is always the best way to interpret a language standard when the wording is ambiguous or soft, especially if you're writing a compiler.

I don't think you could sincerely argue that this definition intends to allow the compiler to totally rewrite your code because of one guaranteed UB detected on line 5, just that it would be good to print a diagnostic if it can be detected, and if not to do what's "characteristic of the environment". Does that make sense?

reply
gpderetta
5 hours ago
[-]
Ex falso quodlibet.

Bounding UB would be a nice idea, or at least prohibiting time-traveling UB (and there is an effort in that direction). But properly specifing it is actually hard.

reply
account42
2 hours ago
[-]
Prohibiting "time-travelling" UB would be horrible as that's a very important mechanism for dead code elimination.
reply
dzaima
1 hour ago
[-]
Even if you forbid "time travel", you can still technically optimize many things as if time travel happened anyway - e.g. want to time-travel back to before some memory store? just pretend that the store happened, but then afterwards the previous value was stored back (and no other threads happen to see the intermediate value)!

Only things you need to worry about then are things with actual observable side-effects - volatile, printf and similar - and C23 does note that all observable behavior should happen even if UB follows, and compilers can't generally optimize function calls anyway (e.g. on systems on which you can define custom printf callbacks, you could put an exit(0) in such, and thus make it incorrect to optimize out a printf ever).

reply
cracki
4 hours ago
[-]
Reading for intent is pragmatic.

Reading adversarially is what people do who are looking for ways that something can be abused, from an offensive or defensive position.

Personally I am tired of the entire topic.

reply
veltas
4 hours ago
[-]
What's bad is when your compiler writers and most of the people involved in standardisation are reading it adversarially.
reply
account42
2 hours ago
[-]
It's bad when compiler writers want to optimize correct code as much as possible, which is something their actual customers keep asking for?
reply
veltas
1 hour ago
[-]
When would optimizing correct code be harmed by not abusing UB (beyond its original intent, e.g. array access should be without overhead of checking for overflow)?
reply
thomashabets2
2 hours ago
[-]
Author here.

I touched on this in the "it's not about optimizations" section. It's not the compiler is out to get you. It's that you told it to do something it cannot express.

It's like if you slipped in a word in French, and not being programmed for French, it misheard the word as a false friend in English. The compiler had no way to represent the French word in it's parse tree.

So no, it's not overly legalistic. Like if the compiler knows that this hardware can do unaligned memory access, but not atomic unaligned access, should it check for alignment in std::atomic<int> ptr but not in int ptr? Probably not, right?

reply
veltas
1 hour ago
[-]
It's not that your article specifically discusses this aspect, but I think it's an important part of the conversation that's being overlooked by commentators, that we've twisted the original intent of UB and made unnecessary work for ourselves. There's been too much scaremongering about UB that's gone beyond the real concerns. If you only fear UB and don't understand it then you are worse off for trying to write safe C or C++.
reply
1718627440
1 hour ago
[-]
The behaviour is bounded by the capability of your machine. It is unlikely that your desktop computer launches a nuclear missile, unless you worked for it to be able to do that.
reply
lelanthran
4 hours ago
[-]
> Is it just me or did compiler writers apply overly legalistic interpretation to the "no requirements" part in this paragraph?

I've (fruitlessly) had this discussion on HN before - super-aggressive optimisations for diminishing rewards are the norm in modern compilers.

In old C compilers, dereferencing NULL was reliable - the code that dereferenced NULL will always be emitted. Now, dereferencing NULL is not reliable, because the compiler may remove that and the program may fail in ways not anticipated (i.e, no access is attempted to memory location 0).

The compiler authors are on the standard, and they tend to push for more cases of UB being added rather than removing what UB there is right now (for exampel, by replacing with Implementation Defined Behaviour).

reply
my-next-account
5 hours ago
[-]
Hello, it's me. I'm not afraid of UB.
reply
saagarjha
38 minutes ago
[-]
You should be!
reply
my-next-account
4 hours ago
[-]
To be honest, miscompilations because of UB is exceedingly rare, and we do a lot of weird shit in our code.
reply
fithisux
5 hours ago
[-]
UB can also have impact in logical cohesion of codebase.
reply
cracki
5 hours ago
[-]
We know. This is not news.
reply
boxed
4 hours ago
[-]
It seems to be to many many programmers who keep using C++
reply
VimEscapeArtist
4 hours ago
[-]
Wait until he discovers PowerShell ;D
reply
logicchains
5 hours ago
[-]
The concept of undefined behaviour is also a very useful lens for understanding LLM-based coding. Anything you don't explicitly specify is undefined behavior, so if you don't want the LLM to potentially pick a ridiculous implementation for some aspect of an application, make sure to explicitly specify how it should be implemented.
reply
SanjayMehta
3 hours ago
[-]
I used to teach C programming and one time I got anonymous feedback: "when this instructor doesn't know the answer he says "it's compiler dependent.""

Shrug.

reply
jraph
5 hours ago
[-]
Yet another push to use LLMs after casting fear. Now it should be illegal not to use LLMs. A good start of the day.

(I hope casting fear is not UB)

reply
wg0
5 hours ago
[-]
The irony is unmistakable.
reply
stevenhuang
4 hours ago
[-]
There is nothing ironic in letting an llm have a pass at identifying potential UB and other correctness issues in C code.

I say this as an experienced C developer.

reply
wg0
4 hours ago
[-]
It is ironic because the behaviour of an LLM itself is UB. Guaranteed.
reply
raverbashing
5 hours ago
[-]
> (I hope casting fear is not UB)

I'm sure that's UB in C

In C++ just use <reinterpret_cast>

reply
nokeya
5 hours ago
[-]
Ok, and?
reply
wg0
5 hours ago
[-]
"Rewrite everything in Rust. OMG universe is written in Rust so memory safe with zero allocations"
reply
stackghost
5 hours ago
[-]
Anyone who uses the construction "C/C++" doesn't write modern C++, and probably isn't very familiar with the recent revisions despite TFA's claims of writing it every day for decades.

Far from being just "C with classes", modern C++ is very different than C. The language is huge and complex, for sure, but nobody is forced to use all of it.

No HN comment can possibly cover all the use cases of C++ but in general, unless you have a very good reason not to:

- eschewing boomer loops in favor of ranges

- using RAII with smart pointers

- move semantics

- using STL containers instead of raw arrays

- borrowing using spans and string views

These things go a long way towards, shall we say, "safe-ish" code without UB. It is not memory-safe enforced at the language level, like Rust, but the upshot is you never need to deal with the Rust community :^)

reply
veltas
5 hours ago
[-]
Although some people, like Bjarne Stroustrup, object to the term C/C++, it's a bit like Richard Stallman objecting to the term "Linux". The fact is it can mean "C or C++", and I wouldn't assume the author thinks they're the same, but they're talking about both of them together in the same sentence. This seems reasonable given this is about undefined behavior, and it's trivial to accidentally write UB-inducing code in C++ even with modern style (although I'd say you should catch most trivial cases with e.g. ubsan, and a lot of bad cases would be avoided with e.g. ranges, so I think the article is exaggerating the issue).
reply
stackghost
5 hours ago
[-]
Well, the author explicitly refers to "C/C++" as one language:

>After all, C/C++ is not a memory safe language.

reply
thomashabets2
2 hours ago
[-]
That is a typo, that I think I introduced when I went back to clarify that it applies to C++ too.

Will fix it.

reply
rectang
5 hours ago
[-]
> the upshot is you never need to deal with the Rust community

In the end, everything comes down to culture war.

reply
stackghost
5 hours ago
[-]
Perhaps we should rewrite our culture in Rust.
reply
thomashabets2
2 hours ago
[-]
Author here.

In the context of UB discussion, the arguments apply equally to C and C++.

How would you write that?

I entirely agree with all your points that C and C++ are completely different languages at this point. And yet I wanted to write this post about something that is true for both.

reply
SpaceNugget
5 hours ago
[-]
I totally agree that modern c++ is pretty robust if you are both a well seasoned developer and only stick to a very blessed subset of it's features and avoid the historical baggage.

However, that's obviously not the point? Ignoring the idea that people can/should just "git gud" and write perfect code in a language with lots of old traps, you can't control how everyone else writes their code, even on your own team once it gets big enough. And there will always be junior devs stumbling into the bear traps of c/c++ (even if the rest of the codebase is all modern c++). So no matter how many great new features get added to C++, until (never) they start taking away the bad ones, the danger inherent to writing in that language doesn't go away.

Also, safe != non-UB. TFA isn't so much about memory safety anyway.

reply
flohofwoe
5 hours ago
[-]
"C/C++" is still a useful term for the common C/C++ subset :)

As far as stdlib usage is concerned: that's just your opinion. The stdlib has a lot of footguns and terrible design decisions too, e.g. std::vector pulling in 20k lines of code into each compilation unit is simply bizarre.

Also:

- eschewing boomer loops in favor of ranges

Those "boomer loops" compile infinitely faster than the new ranges stuff (and they are arguably more readable too): https://aras-p.info/blog/2018/12/28/Modern-C-Lamentations/

- borrowing using spans and string views

Those are just as unsafe as raw pointers. It's not really "borrowing" when the referenced data can disappear while the "borrow" is active.

reply
m-schuetz
5 hours ago
[-]
C/C++ is a perfectly fine term for C or C-style C++. The languages can be very close, and personally I prefer C-style C++ miles over some of the half-baked modern nonsense. I mean, I do use C++23 since it has some great additions, but I'm ditching like 90% of the stuff that only adds complexity without much benefit.
reply
dmitrygr
5 hours ago
[-]
I stoped reading about here:

    > bool parse_packet(const uint8_t* bytes) {
    >   const int* magic_intp = (const int*)bytes;   // UB!
Author, if you are reading this, please cite the spec section explaining that this is UB. Dereferencing the produced pointer may be UB, but casting itself is not, since uint8_t is ~ char and char* can be cast to and from any type.

you might try to argue that uint8_t is not necessarily char, and while it is true that implementations of C can exist where CHAR_BIT > 8, but those do not have uint8_t defined (as per spec), so if you have uint8_t, then it is "unsigned char", which makes this cast perfectly safe and defined as far as i can tell. Of course CHAR_BIT is required to be >= 8, so if it is not >8, it is exactly 8. (In any case, whether uint8_t is literally a typedef of unsigned char is implementation-defined and not actually relevant to whether the cast itself is valid -- it is)

reply
raphlinus
5 hours ago
[-]
The issue is not type punning (itself a very common source of UB), but the fact that the `bytes` pointer might not be int-aligned. The spec is clear that the creation (not just the dereferencing) of an unaligned pointer is UB, see 6.3.2.3 paragraph 7 of the C11 (draft) spec.

Of course, this exchange just demonstrates the larger point, that even a world-class expert in low level programming can easily make mistakes in spotting potential UB.

reply
flohofwoe
4 hours ago
[-]
> Of course, this exchange just demonstrates the larger point, that even a world-class expert in low level programming can easily make mistakes in spotting potential UB.

A "world-class expert in low level programming" knows that unaligned memory accesses are no problem anymore on most modern CPUs, and that this particular UB in the C standard is bogus and needs to fixed ;)

reply
formerly_proven
4 hours ago
[-]
… it’s only UB if the pointer is actually misaligned. It’s not possible to tell from these two lines whether that’s the case.
reply
gritzko
5 hours ago
[-]
C of course is ancient. It remembers the Cambrian explosion of CPU architectures, twelve-bit bytes and everything like that. I wonder if it is possible to codify some pragmatic subset of it that works nicely on currently available CPUs. Cause the author of the piece goes back in time to prove his point (SPARCs and Alphas).
reply
dmitrygr
5 hours ago
[-]
Fun story: even the latest C spec doesn’t require CHAR_BIT == 8, but it does now codify 2s complement int representation. (IIRC)
reply
eru
4 hours ago
[-]
For unsigned ints, or also for signed ints?
reply
account42
2 hours ago
[-]
Two's complement is a representation specifically for signed integers.
reply
dmitrygr
4 hours ago
[-]
For signed. Unsigned overflow was defined for a while now.
reply
gblargg
2 hours ago
[-]
And unsigned negation is two's complement negation as well (-u = 0-u).
reply
dmitrygr
5 hours ago
[-]
That cast is valid. Spec does not guarantee same bit sequence for resulting pointer and source pointer. But as the cast is explicitly allowed, it is not UB. Compiler is free to round the pointer down. Or up. Or even sideways. All ok. Dereferencing it — indeed not ok. But the cast is explicitly allowed and not UB.

Pointer casts changing pointer bit sequences is common on weird platforms (eg: some TI DSPs, PIC, and aarch64+PAC). And it is valid as per spec. Pointer assignment is not required to be the same as memcpy-ing the pointer unto a pointer to another type.

You misunderstood the spec. No promises are made that that cast copies the pointer bit for bit (and thus creates an invalid pointer). Therefore, your objection to invalid pointers is null and void. :)

reply
raphlinus
5 hours ago
[-]
I'm not assuming anything about bit representations. In this case, the spec language is quite clear and unambiguous.

6.3.2.3 paragraph 7: A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned[footnote 68]) for the referenced type, the behavior is undefined. Otherwise, when converted back again, the result shall compare equal to the original pointer. When a pointer to an object is converted to a pointer to a character type, the result points to the lowest addressed byte of the object. Successive increments of the result, up to the size of the object, yield pointers to the remaining bytes of the object.

This is a subsection of section 6.3 which describes conversions, which include both implicit and conversions from a cast operation. This language is not saying anything about bit representations or derefencing.

I happen to be wearing my undefined behavior shirt at the moment, which lends me an extra layer of authority. I'm at RustWeek in Utrecht, and it's one of my favorite shirts to wear at Rust conferences. But let's say for the sake of argument that you are right and I am indeed misunderstanding the spec. Then the logical conclusion is that it's very difficult for even experienced programmers to agree on basic interpretations of what is and what isn't UB in C.

reply
dmitrygr
4 hours ago
[-]
I do not see there a promise that the cast will produce an invalid pointer, nor anything prohibiting the compiler from rounding the pointer down, thus producing a valid one. “Converted” does not require bit copy. I don’t see how this interpretation is against any section of the spec.
reply
dwattttt
3 hours ago
[-]
I also do not see any requirement in the quoted text that the casted pointer be dereferenced before noting "the behavior is undefined".

In practice performing a cast doesn't really do much until you dereference, but without a carve out in the spec, it does really mean "the behavior is undefined".

reply
cyclopeanutopia
4 hours ago
[-]
> Otherwise, when converted back again, the result shall compare equal to the original pointer.

Doesn't this part exclude the possibility of rounding down?

reply
pjc50
3 hours ago
[-]
> rounding the pointer down, thus producing a valid one

A "valid" pointer to the wrong object?

reply
thomashabets2
2 hours ago
[-]
Author here.

> A pointer to an object type may be converted to a pointer to a different object type. If the resulting pointer is not correctly aligned71) for the referenced type, the behavior is undefined.

C23 6.3.2.3p7.

reply
stevenhuang
5 hours ago
[-]
Byte and int has different alignment requirements. It is UB the moment you make such a ptr.

Great way to demonstrate the point of the article.

reply
gritzko
4 hours ago
[-]
That better be marked "historical". At least, Lemire says:

On recent Intel and 64-bit ARM processors, data alignment does not make processing a lot faster. It is a micro-optimization. Data alignment for speed is a myth. // https://lemire.me/blog/2012/05/31/data-alignment-for-speed-m...

(while in the olden days, a program may crash on unaligned access, esp on RISC)

reply
eru
4 hours ago
[-]
Don't mix up what processors do with what the C standard allows you to get away with.
reply
flohofwoe
4 hours ago
[-]
...and don't mix up the C standard with what actually existing compilers allow you to get away with ;) In the end the standard is merely a set of guidelines. What matters is how compiler toolchains behave in the real word, and breaking code which does unaligned memory accesses by 'UB exploitation' would be quite insane.
reply
dmitrygr
4 hours ago
[-]
Without memcpy there is no guarantee that that line produces an invalid pointer

I don’t see what spec part would prohibit that cast from validly compiling to

   BIC r3, r0, #3
Spec only guaranteed round-trip through char* of properly aligned for type pointers. This doesn’t break that.
reply
reinhash
3 hours ago
[-]
Rust.
reply
grougnax
4 hours ago
[-]
Use Rust!
reply
liamd1988
5 hours ago
[-]
When use C ,keep using char* not mess with int*
reply
momo26
5 hours ago
[-]
Debugging in C is soooo hard. When I was writing Malloc Lab in system course, there were uncountable undefined and out of range :(
reply
flohofwoe
5 hours ago
[-]
Yet, debugging memory corruption issues in C and C++ code with modern compiler toolchains and memory debugging tools is infinitely easier than 25 years ago.

(e.g. just compiling with address sanitizer and using static analyzers catch pretty much all of the 'trivial' memory corruption issues).

reply
bullen
1 hour ago
[-]
Everything in Java is defined behaviour, you need a VM with GC to remain sane.

Everything else is a waste of time!

reply