What came first: the CNAME or the A record?
259 points
7 hours ago
| 25 comments
| blog.cloudflare.com
| HN
steve1977
5 hours ago
[-]
I don't find the wording in the RFC to be that ambiguous actually.

> The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.

The "possibly preface" (sic!) to me is obviously to be understood as "if there are any CNAME RRs, the answer to the query is to be prefaced by those CNAME RRs" and not "you can preface the query with the CNAME RRs or you can place them wherever you want".

reply
mrmattyboy
4 hours ago
[-]
I agree this doens't seem too ambiguous - it's "you may do this.." and they said "or we may do the reverse". If I say you're could prefix something.. the alternative isn't that you can suffix it.

But also.. the programmers working on the software running one of the most important (end-user) DNS servers in the world:

1. Changes logic in how CNAME responses are formed

2. I assume some tests at least broke that meant they needed to be "fixed up" (y'know - "when a CNAME is queried, I expect this response")

3. No one saw these changes in test behavoir and thought "I wonder if this order is important". Or "We should research more into this", Or "Are other DNS servers changing order", Or "This should be flagged for a very gradual release".

4. Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken

Cloudflare seem to be getting into thr swing of breaking things and then being transparent. But this really reads as a fun "did you know", not a "we broke things again - please still use us".

There's no real RCA except to blame an RFC - but honestly, for a large-scale operation like there's this seems very big to slip through the cracks.

I would make a joke about South Park's oil "I'm sorry".. but they don't even seem to be

reply
black3r
55 minutes ago
[-]
> 4. Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken

"Testing environment" sounds to me like a real network real user devices are used with (like the network used inside CloudFlare offices). That's what I would do if I was developing a DNS server anyway, other than unit tests (which obviously wouldn't catch this unless they were explicitly written for this case) and maybe integration/end-to-end tests, which might be running in Alpine Linux containers and as such using musl. If that's indeed the case, I can easily imagine how noone noticed anything was broken. First look at this line:

> Most DNS clients don’t have this issue. For example, systemd-resolved first parses the records into an ordered set:

Now think about what real end user devices are using: Windows/macOS/iOS obviously aren't using glibc and Android also has its own C library even though it's Linux-based, and they all probably fall under the "Most DNS clients don't have this issue.".

That leaves GNU/Linux, where we could reasonably expect most software to use glibc for resolving queries, so presumably anyone using Linux on their laptop would catch this right? Except most distributions started using systemd-resolved (most notable exception is Debian, but not many people use that on desktops/laptops), which is a locally-cached recursive DNS server, and as such acts as a middleman between glibc software and the network configured DNS server, so it would resolve 1.1.1.1 queries correctly, and then return the results from its cache ordered by its own ordering algorithm.

reply
jrochkind1
1 hour ago
[-]
> I assume some tests at least broke that meant they needed to be "fixed up"

OP said:

"However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC."

One could guess it's something like -- back when we wrote the tests, years ago, whoever did it missed that this was required, not helped by the fact that the spec proceeded RFC 2119 standardizing the all-caps "MUST" "SHOULD" etc language, which would have helped us translsate specs to tests more completely.

reply
bpt3
3 hours ago
[-]
> Ends up in test environment for, what, a month.. nothing using getaddrinfo from glibc is being used to test this environment or anyone noticed that it was broken

This is the part that is shocking to me. How is getaddrinfo not called in any unit or system tests?

reply
zinekeller
33 minutes ago
[-]
As black3r mentioned (https://news.ycombinator.com/item?id=46686096), it is likely rearranged by systemd, therefore only non-systemd glibc distributions are affected.

I would hazard a guess that their test environment have both the systemd variant and the Unbound variants (Unbound technically does not arrange them, but instead reconstructs it according to RFC "CNAME restart" logic because it is a recursive resolver in itself), but not just plain directly-piped resolv.conf (Presumably because who would run that in this day and age. This is sadly just a half-joke, because only a few people would fall on this category.)

reply
SAI_Peregrinus
2 hours ago
[-]
Probably Alpine containers, so musl's version instead of glibc's.
reply
inopinatus
4 hours ago
[-]
The article makes it very clear that the ambiguity arises in another phrase: “difference in ordering of the RRs in the answer section is not significant”, which is applied to an example; the problem with examples being that they are illustrative, viz. generalisable, and thus may permit reordering everywhere, and in any case, whether they should or shouldn’t becomes a matter of pragmatic context.

Which goes to show, one person’s “obvious understanding” is another’s “did they even read the entire document”.

All of which also serves to highlight the value of normative language, but that came later.

reply
the_mitsuhiko
3 hours ago
[-]
> I don't find the wording in the RFC to be that ambiguous actually.

You might not find it ambiguous but it is ambiguous and there were attempts to fix it. You can find a warmed up discussion about this topic here: https://mailarchive.ietf.org/arch/msg/dnsop/2USkYvbnSIQ8s2vf...

reply
a7b3fa
4 hours ago
[-]
I agree with you, and I also think that their interpretation of example 6.2.1 in the RFC is somewhat nonsensical. It states that “The difference in ordering of the RRs in the answer section is not significant.” But from the RFC, very clearly this comment is relevant only to that particular example; it is comparing two responses and saying that in this case, the different ordering has no semantic effect.

And perhaps this is somewhat pedantic, but they also write that “RFC 1034 section 3.6 defines Resource Record Sets (RRsets) as collections of records with the same name, type, and class.” But looking at the RFC, it never defines such a term; it does say that within a “set” of RRs “associated with a particular name” the order doesn’t matter. But even if the RFC had said “associated with a particular combination of name, type, and class”, I don’t see how that could have introduced ambiguity. It specifies an exception to a general rule, so obviously if the exception doesn’t apply, then the general rule must be followed.

Anyway, Cloudflare probably know their DNS better than I do, but I did not find the article especially persuasive; I think the ambiguity is actually just a misreading, and that the RFC does require a particular ordering of CNAME records.

(ETA:) Although admittedly, while the RFC does say that CNAMEs must come before As in the answer, I don’t necessarily see any clear rule about how CNAME chains must be ordered; the RFC just says “Domain names in RRs which point at another name should always point at the primary name and not the alias ... Of course, by the robustness principle, domain software should not fail when presented with CNAME chains or loops; CNAME chains should be followed”. So actually I guess I do agree that there is some ambiguity about the responses containing CNAME chains.

reply
taeric
4 hours ago
[-]
Isn't this literally noted in the article? The article even points out that the RFC is from before normative words were standardized for hard requirements.
reply
devman0
4 hours ago
[-]
Even if 'possibly preface' is interpreted to mean CNAME RRSets should appear first there is still a broken reliance by some resolvers on the order of CNAME RRsets if there is more than one CNAME in the chain. This expectation of ordering is not promised by the relevant RFCs.
reply
paulddraper
5 hours ago
[-]
100%

I just commented the same.

It's pretty clear that the "possibly" refers to the presence of the CNAME RRs, not the ordering.

reply
Dylan16807
3 hours ago
[-]
The context makes it less clear, but even if we pretend that part is crystal, a comment that stops there is missing the point of the article. All CNAMEs at the start isn't enough. The order of the CNAMEs can cause problems despite perfect RFC compliance.
reply
andrewshadura
3 hours ago
[-]
To me, this reads exactly the opposite.
reply
patrickmay
6 hours ago
[-]
A great example of Hyrum's Law:

"With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody."

combined with failure to follow Postel's Law:

"Be conservative in what you send, be liberal in what you accept."

reply
mmastrac
5 hours ago
[-]
Postel's law is considered more and more harmful as the industry evolved.
reply
CodesInChaos
5 hours ago
[-]
That depends on how Postel's law is interpreted.

What's reasonable is: "Set reserved fields to 0 when writing and ignore them when reading." (I heard that was the original example). Or "Ignore unknown JSON keys" as a modern equivalent.

What's harmful is: Accept an ill defined superset of the valid syntax and interpret it in undocumented ways.

reply
tuetuopay
3 hours ago
[-]
Funny I never read the original example. And in my book, it is harmful, and even worse in JSON, since it's the best way to have a typo somewhere go unnoticed for a long time.
reply
sweetjuly
1 hour ago
[-]
The original example is very common in ISAs at least. Both ARMv8 and RISC-V (likely others too but I don't have as much experience with them) have the idea of requiring software to treat reserved bits as if they were zero for both reading and writing. ARMv8 calls this RES0 and an hardware implementation is constrained to either being write ignore for the field (eg read is hardwired to zero) or returning the last successful write.

This is useful as it allows the ISA to remain compatible with code which is unaware of future extensions which define new functionality for these bits so long as the zero value means "keep the old behavior". For example, a system register may have an EnableNewFeature bit, and older software will end up just writing zero to that field (which preserves the old functionality). This avoids needing to define a new system register for every new feature.

reply
treve
4 hours ago
[-]
Good modern protocols will explicitly define extension points, so 'ingoring unknown JSON keys' is in-spec rather than assumed that an implementer will do.
reply
yxhuvud
3 hours ago
[-]
I disagree. I find accepting extra random bytes in places to be just as harmful. I prefer APIs that push back and tell me what I did wrong when I mess up.
reply
n2d4
5 hours ago
[-]
Very much so. A better law would be conservative in both sending and accepting, as it turns out that if you are liberal in what you accept, senders will choose to disobey Postel's law and be liberal in what they send, too.
reply
mikestorrent
3 hours ago
[-]
It's an oscillation. It goes in cycles. Things formalize upward until you've reinvented XML, SOAP and WSDLs; then a new younger generation comes in and says "all that stuff is boring and tedious, here's this generation's version of duck typing", followed by another ten years of tacking strong types onto that.

MCP seems to be a new round of the cycle beginning again.

reply
Gigachad
2 hours ago
[-]
The modern view seems to be you should just immediately abort if the spec isn't being complied with since it's possibly someone trying to exploit the system with malformed data.
reply
esafak
5 hours ago
[-]
I think it is okay to accept liberally as long as you combine it with warnings for a while to give offenders a chance to fix it.
reply
hdjrudni
5 hours ago
[-]
"Warnings" are like the most difficult thing to 'send' though. If an app or service doesn't outright fail, warnings can be ignored. Even if not ignored... how do you properly inform? A compiler can spit out warnings to your terminal, sure. Test-runners can log warnings. An RPC service? There's no standard I'm aware of. And DNS! Probably even worse. "Yeah, your RRs are out of order but I sorted them for you." where would you put that?
reply
esafak
5 hours ago
[-]
> how do you properly inform?

Through the appropriate channels; in-band and out-of-band.

reply
immibis
2 hours ago
[-]
a content-less tautology
reply
diarrhea
4 hours ago
[-]
Randomly fail or (increasingly) delay a random subset of all requests.
reply
Melonai
3 hours ago
[-]
That sounds awful and will send administrators on a wild goose chase throughout their stack to find the issue without many clues except this thing is failing at seemingly random times. (I myself would suspect something related to network connectivity, maybe requests are timing out? This idea would lead me in the completely wrong direction.)

It also does not give any way to actually see a warning message, where would we even put it? I know for a fact that if my glibc DNS resolver started spitting out errors into /var/log/god_knows_what I would take days to find it, at best the resolver could return some kind of errno with perror giving us a message like "The DNS response has not been correctly formatted", and then hope that the message is caught and forwarded through whatever is wrapping the C library, hopefully into our stderr. And there's so many ways even that could fail.

reply
SahAssar
1 hour ago
[-]
So we arrive at the logical conclusion: You send errors in morse code, encoded as seconds/minutes of failures/successes. Any reasonable person would be able to recognize morse when seeing the patterns on a observability graph.

Start with milliseconds, move on to seconds and so on as the unwanted behavior continues.

reply
psnehanshu
5 hours ago
[-]
Warnings are ignored. It's much better to fail fast.
reply
dotancohen
5 hours ago
[-]
The Python 3 community was famously divided on that matter, wrt Python 3. Now that it is over, most people on the "accept liberally" side of the fence have jumped sides.
reply
ajross
4 hours ago
[-]
That's true, but sort of misses the spirit of Hyrum's law (which is that the world is filled with obscure edge cases).

In this case the broken resolver was the one in the GNU C Library, hardly an obscure situation!

The news here is sort of buried in the story. Basically Cloudflare just didn't test this. Literally every datacenter in the world was going to fail on this change, probably including their own.

reply
black3r
50 minutes ago
[-]
> Literally every datacenter in the world was going to fail on this change

I would expect most datacenters to use their own local recursive caching DNS servers instead of relying on 1.1.1.1 to minimize latency.

reply
chrisweekly
3 hours ago
[-]
Obligatory xkcd for Hyrum's Law: https://xkcd.com/1172
reply
bwblabs
2 hours ago
[-]
I will hijack this post to point out CloudFlare really doesn't understand RFC1034, their DNS authoritative interface only blocks A and AAAA if there is a CNAME defined, e.g. see this:

  $ echo "A AAAA CAA CNAME DS HTTPS LOC MX NS TXT" | sed -r 's/ /\n/g' | sed -r 's/^/rfc1034.wlbd.nl /g' | xargs dig +norec +noall +question +answer +authority @coco.ns.cloudflare.com
  ;rfc1034.wlbd.nl.  IN A
  rfc1034.wlbd.nl. 300 IN CNAME www.example.org.
  ;rfc1034.wlbd.nl.  IN AAAA
  rfc1034.wlbd.nl. 300 IN CNAME www.example.org.
  ;rfc1034.wlbd.nl.  IN CAA
  rfc1034.wlbd.nl. 300 IN CAA 0 issue "really"
  ;rfc1034.wlbd.nl.  IN CNAME
  rfc1034.wlbd.nl. 300 IN CNAME www.example.org.
  ;rfc1034.wlbd.nl.  IN DS
  rfc1034.wlbd.nl. 300 IN DS 0 13 2 21A21D53B97D44AD49676B9476F312BA3CEDB11DDC3EC8D9C7AC6BAC A84271AE
  ;rfc1034.wlbd.nl.  IN HTTPS
  rfc1034.wlbd.nl. 300 IN HTTPS 1 . alpn="h3"
  ;rfc1034.wlbd.nl.  IN LOC
  rfc1034.wlbd.nl. 300 IN LOC 0 0 0.000 N 0 0 0.000 E 0.00m 0.00m 0.00m 0.00m
  ;rfc1034.wlbd.nl.  IN MX
  rfc1034.wlbd.nl. 300 IN MX 0 .
  ;rfc1034.wlbd.nl.  IN NS
  rfc1034.wlbd.nl. 300 IN NS rfc1034.wlbd.nl.
  ;rfc1034.wlbd.nl.  IN TXT
  rfc1034.wlbd.nl. 300 IN TXT "Check my cool label serving TXT and a CNAME, in violation with RFC1034"
The result is DNS resolvers (including CloudFlare Public DNS) will have a cache dependent result if you query e.g. a TXT record (depending if it has the CNAME cached). At internet.nl (https://github.com/internetstandards/) we found out because some people claimed to have some TXT DMARC record, while also CNAMEing this record (which results in cache dependent results, and since internet.nl uses RFC 9156 QName Minimisation, if first resolves A, and therefor caches the CNAME and will never see the TXT). People configure things similar to https://mxtoolbox.com/dmarc/dmarc-setup-cname instructions (which I find in conflict with RFC1034).
reply
ZoneZealot
2 hours ago
[-]
> People configure things similar to https://mxtoolbox.com/dmarc/dmarc-setup-cname instructions (which I find in conflict with RFC1034).

I don't think they're advising anyone create both a CNAME and TXT at the same label - but it certainly looks like that from the weird screenshot at step 5 (which doesn't match the text).

I think it's mistakenly a mish-mash of two different guides, one for 'how to use a CNAME to point to a third party DMARC service entirely' and one for 'how to host the DMARC record yourself' (irrespective of where the RUA goes).

reply
bwblabs
2 hours ago
[-]
I'm not sure, but we're seeing this specifically with _dmarc CNAMEing to '.hosted.dmarc-report.com' together with a TXT record type, also see this discussion users asking for this at deSEC: https://talk.desec.io/t/cannot-create-cname-and-txt-record-f...

My main point was however that it's really not okay that CloudFlare allows setting up other record types (e.g. TXT, but basically any) next to a CNAME.

reply
NelsonMinar
5 hours ago
[-]
It's remarkable that the ordinary DNS lookup function in glibc doesn't work if the records aren't in the right order. It's amazing to me we went 20+ years without that causing more problems. My guess is most people publishing DNS records just sort of knew that the order mattered in practice, maybe figuring it out in early testing.
reply
pixl97
5 hours ago
[-]
I think it's more of a server side ordering, in which there were not that many DNS servers out there, and the ones that didn't keep it in order quickly changed the behavior because of interop.

CNAMES are a huge pain in the ass (as noted by DJB https://cr.yp.to/djbdns/notes.html)

reply
silverwind
5 hours ago
[-]
It's more likely because the internet runs on a very small number of authorative server implementations which all implement this ordering quirk.
reply
immibis
2 hours ago
[-]
This is a recursive resolver quirk
reply
zinekeller
24 minutes ago
[-]
... that was perpetuated by BIND.

(Yes, there are other recursive resolver implementations, but they look at BIND as the reference implementation and absent any contravention to the RFC or intentional design-level decisions, they would follow BIND's mechanism.)

reply
fweimer
1 hour ago
[-]
The last time this came up, people said that it was important to filter out unrelated address records in the answer section (with names to which the CNAME chain starting at the question name does not lead). Without the ordering constraint (or a rather low limit on the number of CNAMEs in a response), this needs a robust data structure for looking up DNS names. Most in-process stub resolvers (including the glibc one) do not implement a DNS cache, so they presently do not have a need to implement such a data structure. This is why eliminating the ordering constraint while preserving record filtering is not a simple code change.
reply
seiferteric
5 hours ago
[-]
Now that I have seemingly taken on managing DNS at my current company I have seen several inadequacies of DNS that I was not aware of before. Main one being that if an upstream DNS server returns SERVFAIL, there is no distinction really between if the server you are querying is failed, or the actual authoritative server upstream is broken (I am aware of EDEs but doesn't really solve this). So clients querying a broken domain will retry each of their configured DNS servers, and our caching layer (Unbound) will also retry each of their upstreams etc... Results in a bunch of pointless upstream queries like an amplification attack. Also have issue with the search path doing stupid queries with NXDOMAIN like badname.company.com, badname.company.othername.com... etc..
reply
indigodaddy
3 hours ago
[-]
re: your SERVFAIL observation, oh man did I run into this exact issue about a year or so ago when this came up for a particular zone. all I was doing was troubleshooting it on the caching server. Took me a day or two to actually look at the auth server and find out that the issue actually rooted from there.
reply
m3047
3 hours ago
[-]
DNS is a wire protocol, payload specification, and application protocol. For all of that, I personally wonder whether its enduring success isn't that it's remarkably underspecified when you get to the corner cases.

There's also so much of it, and it mostly works, most of the time. This creates a hysteresis loop in human judgement of efficacy: even a blind chicken gets corn if it's standing in it. Cisco bought cisco., but (a decade ago, when I had access to the firehose) on any given day belkin. would be in the top 10 TLDs if you looked at the NXDOMAIN traffic. Clients don't opportunistically try TCP (which they shouldn't, according to the specification...), but we have DoT (...but should in practice). My ISPs reverse DNS implementation is so bad that qname minimization breaks... but "nobody should be using qname minimization for reverse DNS", and "Spamhaus is breaking the law by casting shades at qname minimization".

"4096 ought to be enough for anybody" (no, frags are bad. see TCP above). There is only ever one request in a TCP connection... hey, what are these two bytes which are in front of the payload in my TCP connection? People who want to believe that their proprietary headers will be preserved if they forward an application protocol through an arbitrary number of intermediate proxy / forwarders (because that's way easier than running real DNS at the segment edge and logging client information at the application level).

Tangential, but: "But there's more to it, because people doing these things typically describe how it works for them (not how it doesn't work) and onlookers who don't pay close attention conclude "it works"." http://consulting.m3047.net/dubai-letters/dnstap-vs-pcap.htm...

reply
forinti
5 hours ago
[-]
> While in our interpretation the RFCs do not require CNAMEs to appear in any particular order, it’s clear that at least some widely-deployed DNS clients rely on it. As some systems using these clients might be updated infrequently, or never updated at all, we believe it’s best to require CNAME records to appear in-order before any other records.

That's the only reasonable conclusion, really.

reply
hdjrudni
5 hours ago
[-]
And I'm glad they came to it. Even if everyone else is wrong (I'm not saying they are) sometimes you just have to play along.
reply
linsomniac
1 hour ago
[-]
>While in our interpretation the RFCs do not require CNAMEs to appear in any particular order

That seems like some doubling-down BS to me, since they earlier say "It's ambiguous because it doesn't use MUST or SHOULD, which was introduced a decade after the DNS RFC." The RFC says:

>The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.

How do you get to interpreting that, in the face of "MUST" being defined a decade later, as "I guess I can append the CNAME to the answer?

Holding onto "we still think the RFC allows it" is a problem. The world is a lot better if you can just admit to your mistakes and move on. I try to model this at home and at work, because trying to "language lawyer" your way out of being wrong makes the world a worse place.

reply
sebastianmestre
5 hours ago
[-]
I kind of wish they start sending records in randomized order to take out all the broken implementations that depend on such a fragile property
reply
wolttam
3 hours ago
[-]
Is the property of an answer being ordered in the order that resolutions were performed to construct it /that/ fragile?

Randomization within the final answer RRSet is fine (and maybe even preferred in a lot of cases)

reply
t0mas88
2 hours ago
[-]
Well cisco had their switches get into a boot loop, that sounds very broken...
reply
teddyh
1 hour ago
[-]
Cloudflare is well known for breaking DNS standards, and also then writing a new RFC to justify their broken behavior, and getting IETF to approve it. (The existence of RFC 8482 is a disgrace to everyone involved.)

> To prevent any future incidents or confusion, we have written a proposal in the form of an Internet-Draft to be discussed at the IETF

Of course.

reply
tuetuopay
5 hours ago
[-]
Many rightfully interpret the RFC as "CNAME have to be before A", but the issue persists inbetween CNAMEs in the chain as noted in the article. If a record in the middle of the chain expires, glibc would still fail if the "middle" record was to be inserted between CNAMEs and A records.

It’s always DNS.

reply
mdavid626
4 hours ago
[-]
I would expect, that dns servers like 1.1.1.1 at this scale have integration tests running real resolvers, like the one in glibc. How come this issue was discovered only in production?
reply
t0mas88
2 hours ago
[-]
This case would only happen if a CNAME chain first expired from the cache in the wrong order and then subsequently was queried via glibc. Theirs tests may test both that glibc resolving works and that re-querying expired records works, but not the combination of the two.
reply
wolttam
3 hours ago
[-]
My take is quite cynical on this.. This post reads to me like a post-justification of some strange newly introduced behaviour.

Please order the answer in the order the resolutions were performed to arrive at the final answer (regardless of cache timings). Anything else makes little sense, especially not in the name of some micro-optimization (which could likely be approached in other ways that don’t alter behaviour).

reply
Gigachad
2 hours ago
[-]
The DNS specification should be updated to say CNAMES _must_ be ordered at the top rather than "possibly". Cloudflare was complying with the specification. Cisco was relying on unspecified behavior that happened to be common.
reply
danepowell
4 hours ago
[-]
Doesn't the precipitating change optimize memory on the DNS server at the expense of additional memory usage across millions of clients that now need to parse an unordered response?
reply
Dylan16807
4 hours ago
[-]
The memory involved is a kilobyte. The optimization isn't important anywhere. The fragility is what's important.

Also no, the client doesn't need more memory to parse the out-of-order response, it can take multiple passes through the kilobyte.

reply
fweimer
1 hour ago
[-]
For most client interfaces, it's possible to just grab the addresses and ignore the CNAMEs altogether because the names do not matter, or only the name on the address record.

Of course, if the server sends unrelated address records in the answer section, that will result in incorrect data. (A simple counter can detect the end of the answer section, so it's not necessary to chase CNAMEs for section separation.)

reply
ShroudedNight
6 hours ago
[-]
I'm not an IETF process expert. Would this be worth filing errata against the original RFC in addition to their new proposed update?

Also, what's the right mental framework behind deciding when to release a patch RFC vs obsoleting the old standard for a comprehensive update?

reply
hdjrudni
5 hours ago
[-]
I don't know the official process, but as a human that sometimes reads and implements IETF RFCs, I'd appreciate updates to the original doc rather than replacing it with something brand new. Probably with some dated version history.

Otherwise I might go to consult my favorite RFC and not even know its been superseded. And if it has been superseded with a brand new doc, now I have to start from scratch again instead of reading the diff or patch notes to figure out what needs updating.

And if we must supersede, I humbly request a warning be put at the top, linking the new standard.

reply
ShroudedNight
4 hours ago
[-]
At one point I could have sworn they were sticking obsoletion notices in the header, but now I can only find them in the right side-bar:

https://datatracker.ietf.org/doc/html/rfc5245

I agree, that it would be much more helpful if made obvious in the document itself.

It's not obvious that "updated by" notices are treated in any more of a helpful manner than "obsoletes"

reply
fweimer
1 hour ago
[-]
There already is an I-D on this topic (based on previous work): https://datatracker.ietf.org/doc/draft-jabley-dnsop-ordered-...
reply
runningmike
4 hours ago
[-]
The end of this blog is …. “ To learn more about our mission to help build a better Internet,”

Reminds me of https://news.ycombinator.com/item?id=37962674 or see https://tech.tiq.cc/2016/01/why-you-shouldnt-use-cloudflare/

reply
kayson
6 hours ago
[-]
> However, we did not have any tests asserting the behavior remains consistent due to the ambiguous language in the RFC.

Maybe I'm being overly-cynical but I have a hard time believing that they deliberately omitted a test specifically because they reviewed the RFC and found the ambiguous language. I would've expected to see some dialog with IETF beforehand if that were the case. Or some review of the behavior of common DNS clients.

It seems like an oversight, and that's totally fine.

reply
bombcar
6 hours ago
[-]
I took it as being "we wrote the tests to the standard" and then built the code, and whoever was writing the tests didn't read that line as a testable aspect.
reply
kayson
5 hours ago
[-]
Fair enough.
reply
supriyo-biswas
6 hours ago
[-]
My reading of that statement is their test, assuming they had one, looked something like this:

    rrs = resolver.resolve('www.example.test')
    assert Record("cname1.example.test", type="CNAME") in rrs
    assert Record("192.168.0.1", type="A") in rrs
Which wouldn't have caught the ordering problem.
reply
hdjrudni
5 hours ago
[-]
It's implied that they intentionally tested it that way, without any assertions on the order. Not by oversight of incompetence, but because they didn't want to bake the requirement in due to uncertainty.
reply
paulddraper
5 hours ago
[-]
> RFC 1034, published in 1987, defines much of the behavior of the DNS protocol, and should give us an answer on whether the order of CNAME records matters. Section 4.3.1 contains the following text:

> If recursive service is requested and available, the recursive response to a query will be one of the following:

> - The answer to the query, possibly preface by one or more CNAME RRs that specify aliases encountered on the way to an answer.

> While "possibly preface" can be interpreted as a requirement for CNAME records to appear before everything else, it does not use normative key words, such as MUST and SHOULD that modern RFCs use to express requirements. This isn’t a flaw in RFC 1034, but simply a result of its age. RFC 2119, which standardized these key words, was published in 1997, 10 years after RFC 1034.

It's pretty clear that CNAME is at the beginning.

The "possibly" does not refer to the order but rather to the presence.

If they are present, they are are first.

reply
kiwijamo
1 hour ago
[-]
Some people (myself included) read that as "would ideally come first, but it is not neccessary that it comes first". The language is not clear IMHO and could be worded better.
reply
urbandw311er
2 hours ago
[-]
The whole world knows this except Cloudflare who actually did know it but are now trying to pretend that they didn’t.
reply
1vuio0pswjnm7
3 hours ago
[-]
"One such implementation that broke is the getaddrinfo function in glibc, which is commonly used on Linux for DNS resolution. When looking at its getanswer_r implementation, we can indeed see it expects to find the CNAME records before any answers:"

Wherever possible I compile with gethostbyname instead of getaddrinfo. I use musl instead of glibc

Nothing against IPv6 but I do not use it on the computers and networks I control

reply
1vuio0pswjnm7
26 minutes ago
[-]
NB. This is not code that belongs to me

When compiling software written by others, sometimes there are compile-time options that allow not using getaddrinfo or IPv6

For example,

links (--without-getaddrinfo)

haproxy (USE_GETADDRINFO="")

tnftp (--disable-ipv6)

elinks (--disable-ipv6)

wolfssl (ipv6 disabled by default)

stunnel (--disable-ipv6)

socat (--disable-ipv6)

and many more

Together with localhost TLS forward proxy I also use lots of older software that only used gethostbyname, e.g., original netcat, ucspi-tcp, libwww, original links, etc.

Generally I avoid mobile OS (corporate OS for data collection, surveillance and ad services)

Mobile data is disabled. I almost never use cellular networks for internet

Mobile sucks for internet IMHO; I have zero expectation re: speed and I cannot control what ISPs choose to do

For me, non-corporate UNIX-like OS are smaller, faster, easier to control, more interesting

reply
immibis
2 hours ago
[-]
Your code runs slower on mobile devices, since (as a rule of thumb) mobile networks are ipv6-only and ipv4 traffic has to pass through a few layers of tunneling.
reply
urbandw311er
2 hours ago
[-]
I feel like they fucked it up then, when writing the post-mortem, went hunting for facts to retrospectively justify their previous decisions.
reply
renewiltord
6 hours ago
[-]
Nice analysis. Boy I can’t imagine having to work at Cloudflare on this stuff. A month to get your “small in code” change out only to find some bums somewhere have written code that will make it not work.
reply
stackskipton
6 hours ago
[-]
Or when working on massive infrastructure like this, you write plenty of tests that would have saved you a month worth of work.

They write reordering, push it and glibc tester fires, fails and you quickly discover "Crap, tests are failing and dependency (glibc) doesn't work way I thought it would."

reply
renewiltord
5 hours ago
[-]
I suspect that if you could save them this time, they'd gladly pay you for it. It'll be a bit of a sell, but they seem like a fairly sensible org.
reply
rjh29
4 hours ago
[-]
It was glibc's resolver that failed - not exactly obscure. It wasn't properly tested or rolled out, plain and simple.
reply
urbandw311er
2 hours ago
[-]
Or — hot take — to find out that you made some silly misinterpretation of the RFC that you then felt the need to retrospectively justify.
reply
frumplestlatz
6 hours ago
[-]
Given my years of experience with Cisco "quality", I'm not surprised by this:

> Another notable affected implementation was the DNSC process in three models of Cisco ethernet switches. In the case where switches had been configured to use 1.1.1.1 these switches experienced spontaneous reboot loops when they received a response containing the reordered CNAMEs.

... but I am surprised by this:

> One such implementation that broke is the getaddrinfo function in glibc, which is commonly used on Linux for DNS resolution.

Not that glibc did anything wrong -- I'm just surprised that anyone is implementing an internet-scale caching resolver without a comprehensive test suite that includes one of the most common client implementations on the planet.

reply
therein
5 hours ago
[-]
After the release got reverted, it took an 1hr28min for the deployment to propagate. You'd think that would be a very long time for CloudFlare infrastructure.
reply
rhplus
5 hours ago
[-]
We should probably all be glad that CloudFlare doesn't have the ability to update its entire global fleet any faster than 1h 28m, even if it’s a rollback operation.

Any change to a global service like that, even a rollback (or data deployment or config change), should be released to a subset of the fleet first, monitored, and then rolled out progressively.

reply
tuetuopay
5 hours ago
[-]
Given the seriousness of outages they make with instant worldwide deploys, I’m glad they took it calmly.
reply
steve1977
5 hours ago
[-]
They had to update all the down detectors first.
reply
charcircuit
6 hours ago
[-]
Random DNS servers and clients being broken in weird ways is such a common problem and will probably never go away unless DNS is abandoned altogether.

It's surprising how something so simple can be so broken.

reply
torstenvl
1 hour ago
[-]
> One such implementation that broke is the getaddrinfo function in glibc, which is commonly used on Linux for DNS resolution.

> Most DNS clients don’t have this issue.

The most widespread implementation on the most widespread server operating system has the issue. I'm skeptical of what the author means by "Most DNS clients."

Also, what is the point of deploying to test if you aren't going to test against extremely common scenarios (like getaddrinfo)?

> To prevent any future incidents or confusion, we have written a proposal in the form of an Internet-Draft to be discussed at the IETF. If consensus is reached...

Pretty sure both Hyrum's Law and Postel's Law have reached the point of consensus.

Being conservative in what you emit means following the spec's most conservative interpretation, even if you think the way it's worded gives you some wiggle room. And the fact that your previous implementation did it that way for a decade means people have come to rely on it.

reply