You want an upside down pyramid, in which every checked subsystem contributes an OK or some failure, and failure of these checks is the most serious failure, so the output from the bottom of your pyramid is in theory a single green OK. In practice, systems have always failed or are operating in some degraded state.
In this design the alternatives are: 1. Monitor says the Geese are Transmogrified correctly or 2. Monitoring detected a Goose Transmogrifier problem, or 3. Goose Transmogrifier Monitor failed. The absence of any overall result is a sign that the bottom of the pyramid failed, there is a major disaster, we need to urgently get monitoring working.
What I tend to see is instead a pyramid where the alternatives 1 and 2 work but 3 is silent, and in a summarisation layer, that can fail silently too, and in subsequent layers the same. In this system you always have an unknown amount of silently failed systems. You are flying blind.
I have a simple Python script that runs every day and checks the certificates of multiple sites.
One time this script signaled that a cert was close to expiring even though I saw a newer cert in my browser. It turned out that I had accidentally launched another reverse proxy instance which was stuck on the old cert. Requests were randomly passed to either instance. The script helped me correct this mistake before it caused issues.
For example say you've got an internal test endpoint, two US endpoints and a rest-of-world endpoint, physically located in four places. Maybe your renewal process works with a month left - but the code to replace working certificates in a running instance is bugged. So, maybe Monday that renewal happens, your "CT log monitor" approach is green, but nobody gets new certs.
On Wednesday engineers ship a new test release to the test endpoint, restarting and thus grabbing the renewed cert, for them everything seems great. Then on Friday afternoon a weird glitch happens for some US customers, restarting both US servers seems to fix the glitch and now US customers also see a renewed cert. But a month later the Asian customers complain everything is broken - because their endpoint is still using the old certificate.
I manually approve the authenticity of the server on the first connection.
From then, the only time I'd be prompted again would be, if either the server changed or if there's a risk of MITM.
Why can't we have this for the web?
How do you propose to scale trust on first use? SSH basically says the trusting of a key is "out of scope" for them and makes it your problem. As in: You can put on a piece of paper, tell it over the phone, whatever, but SSH isn't going to solve it for you. How is some user landing on a HTTPS site going to determine the key used is actually trustworthy?
There have actually been attempts at solving this with some thing like DANE [1]. For a brief period Chrome had DANE support but it was removed due to being too complicated and being in (security) critical components. Besides, since DNSSEC has some cracks in it (you local resolver probably doesn't check it) you can have a discussion about how secure DANE is.
[1] https://en.wikipedia.org/wiki/DNS-based_Authentication_of_Na...
The main lesson we took from this was: you absolutely need monitoring for cert expiration, with alert when (valid_to - now) becomes less than typical refresh window.
It's easy to forget this, especially when it's not strictly part of your app, but essential nonetheless.
You can update your cert to prepare for it by appending—-NEW CERT—-
To the same file as ——-OLD CERT—-
But you also need to know where all your certificates are located. We were using Venafi for the auto discovery and email notifications. Prometheus ssl_exporter with Grafana integration and email alerts works the same. The problem is knowing where all hosts, containers and systems that have certs are located. Simple nmap style scan of all endpoints can help. But, you might also have containers with certs or you might have certs baked into VM images. Sure, there all sorts of things like storing the cert in a CICD global variable, bind mounting secrets, Vault Secret Injector, etc
But it’s all rooted in maintaining a valid, up to date TLS inventory. And that’s hard. As the article states: “ There’s no natural signal back to the operators that the SSL certificate is getting close to expiry. To make things worse, there’s no staging of the change that triggers the expiration, because the change is time, and time marches on for everyone. You can’t set the SSL certificate expiration so it kicks in at different times for different cohorts of users.”
Every time this happens you whack a mole a change. You get better at it but not before you lose some credibility
I’ve always used the calendar event before expiry and then manual renew option but I wonder why I didn’t do this. It’s trivial to roll out. With Route53 just make one canary LB and balance 1% traffic to it. Can be entirely automated.
A certificate renewal process has several points at which failure can be detected and action taken, and it sounds like this team was relying only on a “failed to renew” alert/monitor.
A broken alerting system is mentioned “didn’t alert for whatever reason”.
If this certificate is so critical, they should also have something that alerts if you’re still serving a certificate with less than 2 weeks validity - by that time you should have already obtained and rotated in a new certificate. This gives plenty of time for someone to manually inspect and fix.
Sounds like a case of “nothing in this automated process can fail, so we only need this one trivial monitor which also can’t fail so meh” attitude.
This is also why you want a mix of alerts from the service users point of view, as well as internal troubleshooting alerts. The users point-of-view alerts usually give more value and can be surprisingly simple at times.
"Remaining validity of the certificates offered by the service" is a classical check from the users point of view. It may not tell you why this is going wrong, but it tells you something is going wrong. This captures a multitude of different possible errors - certs not reloading, the wrong certs being loaded, certs not being issued, DNS going to the wrong instance, new, shorter cert lifecycles, outages at the CA, and so on.
And then you can add further checks into the machinery to speed up the process of finding out why: Checks if the cert creation jobs run properly, checks if the certs on disk / in secret store are loaded or not, ...
Good alerting solutions might also allow relationships between these alerts to simplify troubleshooting as well: Don't alert for the cert expiry, if there is a failed cert renew cron job, alert for that instead.
Don't worry. With 2 or 3 industry players dictating how all TLS certs work, now your certs will expire in weeks rather than years, so you will all be subject to these failures more frequently. But as a back-stop to process failures like this, use automated immutable runbooks in CI/CD. It works like this:
1) Does it need a runbook? Ask yourself, if everything was deleted tomorrow, do you (and all the other people) remember every step needed to get everything running again? If not, it needs a runbook.
2) What's a runbook? It's a document that gives step by step instructions to do a thing. The steps can be text, video recordings, code/shell snippets, etc as long as it does not assume anything and gives all necessary instructions (or links to them) so a random braindead engineer at 3am can just do what it says and it'll result in a working thing.
3) Automate the runbook over time. Put more and more of the steps into some kind of script the user can just run. Put the script into a Docker container so that everyone's laptop environment doesn't have to be identical for the steps to work.
4) Run the containerized script from CI/CD. This ensures all credentials, environment vars, networking, etc are the same when it runs which better ensures success, and that leads to:
5) Running it frequently/on a schedule. Most CI/CD systems support scheduled jobs. Run your runbooks frequently to identify unexpected failures and fix bugs. Most of you get notifications for failed builds, so you'll see failed runbooks. If you use a cron job on a random server, the server could go down, the job could get deleted, or the reports of failure could go to /dev/null; but nobody's missing their CI/CD build failures.
Running runbooks from CI/CD is a game changer. Most devs will never update a document. Some will update code they run on their laptop. But if it runs from CI/CD, now anyone can run it, and anyone can update it, so people actually do keep it up to date.
Having only one SSL certificate is a single point of failure, we have eliminated single points of failure almost everywhere else.
Edit: but to be clear, I don’t understand why you’d want this. If you’re worried about your CA going offline, you should shorten your renewal period instead.
Update: looks like the answer is yes. So then the issue is people not taking advantage of this technique.
The main reason to have multiple certs is so if your host (and cert prov key) is compromised, you can quickly switch to a backup, without first having to sort out getting a new cert issued.
Both Apache (SSLCertificateFile) and nginx (ssl_certificate) allow for multiple files, though they cannot be of the same algorithm: you can have one RSA, one ECC, etc, but not (say) an ECC and another ECC. (This may be a limitation of OpenSSL.)
So if the RSA expires on Feb 1, you can have the ECC expire on Feb 14 or Mar 1.
It is not about encryption (that a self-signed certificate lasting till 2035 will suffice), but verification, who am I talking with, because reaching the right server can be messed up with DNS or routing, among other things. Yes, that adds complexity, but we are talking more about trust than technology.
And once you recognize that it is essential to have a trusted service, then give it the proper instrumentation to ensure that it work properly, including monitoring and expiration alerts, and documentation about it, not just "it works" and dismiss it.
May we retitle the post as "The dangers of not understanding SSL Certificates"?
And I said above, SSL is more than about encryption, but also knowing that you are connecting to the right party. Maybe for a repository with multiple mirrors, dns aliases and a layer of "knowing from whom this come from" is not that essential, but for most the rest, even if the information is public, knowing that it comes from the authoritative source or really from who you think it comes from is important.
In technology, there are known problems and unknown problems. Expiring TLS certificates is a known problem which has an established solution.
Imagine if only some of the requests failed because a certificate is about to expire. That would be a debugging nightmare.
Anyway you'll have one of these things anyway and I haven't seen one yet that doesn't let you monitor your cert and send you expiration notices in advance.
But for human persons and personal websites HTTP+HTTPS fixes this easily and completely. You get the best of both worlds. Fragile short lifetime pseudo-privacy if you want it (HTTPS) and long term stable access no matter what via HTTP. HTTPS-only does more harm than good. HTTP+HTTPS is far better than either alone.
> There’s no natural signal back to the operators that the SSL certificate is getting close to expiry.
There is. The not after is right there in the certificate itself. Just look at it with openssl x509 -text and set yourself up some alerts… it’s so frustrating having to refute such random bs every time when talking to clients because some guy on the internet has no idea but blogs about their own inefficiencies.
Furthermore, their autorenew should have been failing loud and clear, everyone should know from metrics or logs… but nobody noticed anything.
OpenSSL is still called OpenSSL. Despite "SSL" not being the proper name anymore, people are still going to use it.
By the way, TLS 1.3 is actually SSL v3.4 :)
TLS 1.3 version in the record header is 3.1 (that used by TLS 1.0), and later in the client version is 3.3 (that used by TLS 1.2). Neither is correct, they should be 3.4, or 4.0 or something incrementally larger than 3.1 and 3.3.
This number basically corresponds to the SSL 3.x branch from which TLS descended from. There's a good website which visually explains this:
https://tls13.xargs.org/#client-hello/annotated
As for if someone is correct or whatever for calling out TLS 1.x as SSL 3.(x+1) IDK how much it really matters. Maybe they're correct in some nerdy way, like I could have called Solaris 3 as SunOS6 and maybe there were some artifacts in the OS to justify my feelings about that. It's certainly more proper to call things by their marketing name, but it's also interesting to note on they behave on the wire.
How long did it take for us to get to a "letsencrypt" setup? and exactly 100ms before that existed, you (meaning 90% of you) mocked and derided that very idea
This has given me some interesting food for thought. I wonder how feasible it would be to create a toy webserver that did exactly this (failing an increasing percentage of requests as the deadline approaches)? My thought would be to start failing some requests as the deadline approaches a point where most would consider it "far too late" (e.g. 4 hours before `notAfter`). At this point, start responding to some percentage of requests with a custom HTTP status code (599 for the sake of example).
Probably a lot less useful than just monitoring each webserver endpoint's TLS cert using synthetics, but it's given me an idea for a fun project if nothing else.
Just check expiration of the active certificate; if it’s under a threshold (say 1 week, assuming you auto-renew it when it’s 3 weeks to expiry; still serving a cert when it’s 1 week to expiration is enough signal that something went wrong) then you alert.
Then you just need to test that your alerting system is reliable. No need to use your users as canaries.
In real life, I guess there are people who don't monitor at all. For them failing requests would go unnoticed ... for the others monitoring must be easy.
But I think the core thing might be to make monitoring SSL lifetime the "obvious" default: All the grafana dashboards etc should have such an entry.
Then as soon as I setup a monitoring stack I get that reminder as well.
The canary narrows the blast radius and time-to-detection.