This is the kind of thing you read in a post-mortem and wonder how they designed something so fiendishly wonderful.
At 2:00am our MySQL master failed and failed over successfully to our secondary server. As part of post-failover ops, ansible playbook proceeded to login to 1000 instances to update the hosts file for the new master. This caused traffic amplication which caused our Etcd nodes to believe they were down. As the etcd nodes failed over, our ansible playbook proceeded to then login to 1000 instances to update the hosts file...
Honestly, whatever system you built is justing do the same exact function as DNS just with extra steps. If you squint really hard /etc/hosts is your local dns cache and ansible is your resolver. I think this kind of "simplification fetishization" is dangerously attractive to people who have only managed relatively simply setups. I don't think anyone who has ever had to deal with high-availability failover would consider Ansible a good solution.
The problem that so many people hit with DNS isn't specific to DNS the protocol - it's the problem of service discovery. This architecture doesn't eliminate service discovery, it just moves it to a far more brittle configuration.
The application would log into every router in the network and run a massive, on the fly script to manually create a bunch of PPPOE services, shaping targets for those connections, update firewall rules etc.
It would also run manual mikrotik bandwidth tests across every logical link it was aware of.
The application developers were adamant that this was the best way of doing things, and any disagreement would have them point at their dozen or so customers and boast that they surely wouldnt have been able to hoodwink that many people if they were doing it wrong.
Anyway we took a packet capture of all the every 10 minute script updates and demonstrated those to the customer as a whole number % of their bandwidth to certain smaller sites, and also were able to show them how they stopped getting "My internet goes out every 10 minutes" complaints as we turned off the automatic mikrotik bandwidth tests running every 10 minutes.
But to save their customer the application developers agreed to implement SNMP and RADIUS but they never did. IIRC their fee was a flat 15% of all profits generated by the customer, which was just staggering. And the fee could rise if they asked for support.
That's still just as true for the intranets of the 2020s with thousands of machines all downloading a HOSTS file several times a day (or even hour/minute) as it was for the Internet of July 1983 with around 500 hosts that was merely downloaded by everyone a couple of times per week. The fact that a file can be copied faster now is counterbalanced by the fact that tying this to real-time failover means that it needs to be updated and distributed several orders of magnitude more quickly than it was in 1983 too. And that's ignoring the linear nature of a HOSTS file lookup contrasted with even the stupidest DNS implementation.
Those who think that HOSTS is a fallback for any sort of dynamic operation (into and out of service) of even hundreds of machines have not learned the history of why the DNS even exists.
TFA proposed that /etc/hosts or the like should be used only for the benefit of administrators, to allow manual connections by name instead of by address, and presumably to make easy to interpret the activity logs. This is a desirable feature, but the network should work fine even when the name-to-address translation is temporarily unavailable, because of not-yet-updated /etc/hosts files.
Actually I have used for decades a system similar to what TFA proposes, avoiding to do DNS queries for the internal networks, while using my own DNS caching resolver for the Internet, but this was done only in relatively small networks, with a few hundred nodes at most, and where the IP addresses were changed infrequently. Thus I have no idea whether in a big network with frequently changed addresses there would be scaling problems.
Great piece of history. The RFC is a bit older than I am so I've never studied it. Looking at it that way, then OP has just re-invented DNS.
If you need to eliminate DNS and convince the internet it's largely unnecessary for the use-cases we have today...
...only to completely reinvent DNS for those use-cases with inferior technology that eventually becomes DNS...
...then you have achieved wisdom. I applaud the author for being on this journey.
Shell scripts wrapped in YAML
It is used to make entire protocols work (MX records for email, but SRV records are used for much more).
Now, if we do look at the most basic of basic DNS roles — mapping a human readable name to arbitrary set of numbers identifying a machine on the network — we should consider how do we avoid some of the issues while keeping all of the benefits of DNS.
Eg. if we indeed "materialize" machine identifiers, we lose the ability to do virtual hosting (domains not passed in) or fix a problem with just a DNS update (eg. treating load-balancing machines like cattle).
The author jumps immediately to, arguably, ill advised materialization techniques like /etc/hosts, without considering all that DNS does for a complex, real world system and what goes missing.
DNS is one mechanism of adding a layer of abstraction.
So let's not make a general argument when there are specifics to be discussed — do you have an argument for why mapping names to IDs is an abstraction too much here?
- DNS load balancing is not that important for internal services in most Cases? Would only use it if alternatives won’t work.
- the virtual host issue is really adressed by /etc/hosts, I thought that was obvious, I now regret not explicitly adressing it.
In the other example (Amazon DynamoDB issue), the problem is with dynamically choosing from a large dynamic pool of IP addresses for a service — DNS is but one mechanism to do it. If it wasn't DNS, it could have been something else that did that job that was broken. Even /etc/hosts if it was updated with an empty record.
What I am saying is that your analysis is not defining the problem you want solved exactly, your examples are not backing up your proposal or analysis, and you are ignoring all the things DNS does both for public and private infrastructure. You seem to have some intuition about this adding complexity and thus being a risk (which is true), but you need to do a better job of connecting and analysing real risks and proposed solutions (and their comparative performance).
Yet, I'm not arguing for Facebook or similar size companies to ditch DNS internally. I'm making the argument for much smaller organisations to pause and think where their own risks lie and if it would make sense to cut out DNS to reduce risk. Whatever process you used as an organisation to update DNS in a safe manner, you still use with the alternative solution, that doesn't change.
That said, even an broken update to /etc/hosts is probably easier and faster to recover from than a broken DNS service that everything is tied to and due to TTL caching, can take much longer to resolve.
Eg. even if you are DNS based but have direct SSH access to the system which has a query cached and root access on it (you need to manage all this too!), you can temporarily edit /etc/hosts or /etc/resolv.conf to workaround the cached value.
So my suggestion remains to keep working on a better argument and scenario by trying to understand exactly where your intuition applies — but be critical to yourself too, and think through if your alternative has any other cons too.
By doing so, you will likely find why everybody defaults to DNS for a named service registry in a sense.
I fail to see how, especially if you were to accidentally break your ability to push those updates out.
A smaller organisation should have a much easier time implementing internal DNS and it should be pretty damn stable and reliable. Unfortunately a lot of people dont properly understand it (not that you need to be a complete expert - just competent) and hence we always have the mantra "Its always DNS" when something goes down.
Usually complicated beskoke systems engineered for internal use are better left for really large orgs that can hire the talent to maintain and properly implement it (and have the manpower to have enough people in the first place always on staff to maintain it when the first person gets sick or something)
We are talking about 300sec (=5Min), this is never an issue
this is classic "easy vs. simple" folly, witness how someone too lazy to [learn how to] setup proper DNS for their infrastructure will do 10x the work hacking something "easy"
If you remove DNS servers from the equation, you need to write down records for other domains, too. This means you have to chase every domain for changes in CDN configuration, hosting provider or ISP migrations, IPv4 to v6 migrations and so on.
You don't have PTR records, which means you can't find out a name from its IP address.
You also miss other features of DNS, like SRV, MX and so on.
More subtly, you lose the ability to control DNS resolution over systems you can't control. If a DNS server says host.example.com is 192.168.0.4, a Windows desktop, a Linux server and your toaster will agree on that (especially if no local cache is enabled, but even then TTLs apply). If for some reason you cannot control a particular machine, you will never get it to consider that new DNS record. This can happen for a lot of reasons.
And I explicitly argue within the section about egress filtering that allowing systems access to public DNS is a security risk.
What happens if you need to scale up (either a lot or a little) and you need to hire new people?
People are often the most vulnerable chain in your infrastructure.
Because now you've replaced one single point of failure configuration system with caching and TTLs (DNS) with a higher maintenance and much less widely supported one.
For LB you'll need something in front of your service to bounce connections around, which is replacing one point of failure (DNS) for another (HAproxy, IPVS). Though I guess you can run the LB stack on your app service servers.
Thanks for the laugh...
Suggesting that a push-based, Ansible-based architecture will scale to hundreds of thousands of targets, with such pushes happening hundreds if not thousands of times a day, is a junior-level idea at best, dark comedy if I'm being charitable, and professional malpractice at worst.
> The Facebook / Meta outage was so significant
The author specifically called out the Meta outage, as if he was offering a prescription ("It's easy to configure systems with tools like Ansible or pyinfra at scale") that would have prevented Meta (at Meta's scale) from suffering an outage. The argument that Meta should not have used DNS except that Meta runs at a scale where DNS is necessary... who comes up with these arguments?
It is pretty insane to switch from DNS servers to pushing domain config to every single client every single update.
From TFA
>There are multiple(1) high-profile(2) incidents where DNS was involved. In these linked cases, the root-cause of the incident isn't the DNS system itself. Yet, because the root-cause affects the DNS service - which is in the critical path for virtually all services - the incident has such a huge impact.
From AWS incident report linked in TFA
>The root cause of this issue was a latent race condition in the DynamoDB DNS management system that resulted in an incorrect empty DNS record for the service’s regional endpoint
People, services, machines, etc need to "dial" canonical-somewhere. Whatever does the canonical management is the piece that when it breaks everything breaks.
Doesn't matter if it's DNS, EIP rotation, some HA proxy, whatever. It'll break.
It's actually that DNS is so well understood that it doesn't fail more often.
So no, DNS is for IT Infra.
That is absolutely true. I believe that a solution where you provision a text file with an updated ip address or /etc/hosts file is inherently simpler, less risky and easier to recover from, although I admit I don't explicitly state this in the article.
> I believe that a solution where you provision a text file with an updated ip address or /etc/hosts file is inherently simpler
So simple that it doesn't scale beyond a few machines nor outside your org.
You are wrong. Its possible that your confidence in being wrong is due to your inexperience. But you are still wrong.
One annoying reason is you don't own it/have access through the owner anymore.
> Sure would be nice to just update the DNS record to point to the new address.
EC2. Elastic IPs are easy enough, but, precisely, I would just like to make a Route53 alias for an EC2 instance and not even have to care.
If its a list of IP addresses, having a list of ip addresses is a crude service discovery protocol.
Tasking developers (because lets be absolutely clear, the idea of removing DNS from production environments is something only a developer could come up with, no competent engineer would ever raise) with maintaining ordered lists of servers to keep updated is only going to overcomplicate things.
And yes your hosts file is another example of a list.
I take from some of the other comments he uses /etc/hosts on hosts with Ansible to provide resolving. Sounds convoluted as /etc/resolv.conf and libc resolvers works. Go for the lowest fallback and dump files with Ansible. Homelab with extra steps, as setting up a DNS server is easy, ... Consider coredns, dnsmasq, if bind is too much
And for very specific nit picks, and I can’t believe I’m entertaining this idea enough to ask, but tell me how the new device on the network bootstraps without DNS? And the guest device. And the printer without Ansible support. And the NDI receiver that needs to resolve its host. And how do you resolve split brain resolution for roaming devices? Are you going to publicly address all internal resources now so my laptop keeps working outside the office?
DNS was not created as a random solution looking for a problem…
How a new device bootstraps on the network without DNS? Depends, on the device, but a physical server doesn't need DNS, only PXE boot / TFTP / HTTP as usual and maybe a proxy to access an update server if you don't run one yourself.
DHCP
But to address the article in a simple environment dns _just_works_. I’ve never once had an issue with bind. It’s incredibly simple and stable and easy to understand when working with within a small environment without much churn and enables other technologies to operate in an expected way because it’s the standard. ACME, kerberos, sshfp, many more are enabled by DNS. Sure maybe you can kludge some of that back together with hosts but I’d rather not just to replace one of the most stable services that exist.
DNS does start to get more complicated in massive environments but that’s just a reflection of the environment. Using ansible to manage /etc/hosts across hundreds or thousands machines with churn will not be less complicated to manage than dns.
The problem with DNS per the haiku isn’t that it’s difficult to understand, or even that running your own DNS server is particularly difficult. It’s that coordinating information and exchange at scale is a tricky problem with a lot of non-obvious edge cases and foot guns.
So trying to reduce complexity by sidestepping DNS really doesn’t do that, it just leaves you holding the bag on all the problems that DNS was quietly solving for you in the background.
For small groups of servers, with limited egress communication, it might nevertheless make sense. And then go for it, by all means. As a general replacement for DNS, not likely.
It is hard to see how Ansible should be simpler than DNS. Maybe if you have worked with Ansible and not DNS, you might think so.
In SOHO settings I might actually agree, but, this is where I think site administered and distributed multicast DNS was a missed opportunity.
> https://www.rfc-editor.org/info/rfc9364/#name-dnssec-core-do...
DNS is for Infrastructure, people use infrastructure.
>That got me thinking, why would we use DNS for infrastructure services? It isn't necessary for machine-to-machine communication. Instead of configuring domain names that may not resolve, we can just directly inject the appropriate IP address(ess) into configuration files. It's easy to configure systems with tools like Ansible or pyinfra at scale.
No no no no god no.
"What if we set up a convoluted higher level application solution"
This is going to go wrong more frequently and contain more errors than DNS.
>Fortunately, we still have /etc/hosts, which we can easily provision. Still no DNS service required! This way, we can configure domain names and pretend to use DNS. I also suspect that DNS queries against /etc/hosts are quite responsive.
No thats a horrible idea. Userspace should never be updating your hosts file, users will fall behind on changes and be placed at extreme security risk. Fully half the benefit of UAC on windows is preventing persistence by preventing malicious entities from updating hosts.
>As of today, most network traffic is encrypted by default, or tunneled through an encrypted channel. DNS is - by default - the exception.
DNS is mostly secure now, to the point where its a problem. But thats a vendor issue not a you issue please dont attempt to solve it. If you go full encrypted DNS you generally also get dragged into HTTPS proxying and things of that nature. This does not get better by removing a dynamic protocol for querying names.
>Due to this risk, there is a case to be made, to - at least - not allow systems to query public DNS records. As servers may need to interfact with services on the internet (update servers, APIs, and so on), such access can be facilitated by a proxy server using allow-listed domains.
Attackers use DNS because its versatile and resistant to the very issues you keep confidently presenting. A protocol is not a risk just because hackers use it. Hackers also use HTTPS and other protocols but we arent burning them at the stake.
>That said, I think it's reasonable to explore if DNS can be avoided altogether within the IT infrastructure to increase reliability and robustness.
Its reasonable for people with much better understanding of the infrastructure and protocol to examine these things. This reads like an end user suggesting "what if we deliver websites by hand printed on paper".
This isn't how any of this was really supposed to work. Back in the day the application identifier was the _port number_, according to a big list maintained by ICANN. The idea was that you could go to a machine (identified by IP or more conveniently by DNS) and see if it was running an instance of the ‘Facebook’ application, i.e. you'd find Facebook not at facebook.com:https but at meta.com:facebook. The end goal was to eliminate the need for the former part at some point, and come up with a better way of looking up applications than distributing a list by email. Instead the application ID is now used for transport and the host name instead encodes application ID, which it was never meant for, and that's why we can't have nice things (like device mobility).
Tell me that you've never used Ansible at scale without telling me that you've never used Ansible at scale.
Ansible also does not have locks or parallel users coordination, so you’ll need a single user/VM/GHA workflow running the playbook or at some point concurrent users will start overriding each other.
Now it’s for machines