Since the servers were mine, I could see what was happening, and I was very impressed. Within I want to say two minutes, the instances had been fully provisioned and were actively archiving videos as fast as was possible, fully saturating the connection, with each instance knowing to only grab videos the other instances had not already gotten. Basically they have always struck me as not only having a solid mission, but also being ultra-efficient in how they carry it out.
Edit: Like they kinda seem like an unnecessary middle-man between the archive and archivee, but maybe I'm missing something.
This is in contrast to the Wayback Machine's builtin crawler, which is just a broad spectrum internet crawler without any specific rules, prioritizations, or supplementary link lists.
For example, one ArchiveTeam project had the goal to save as many obscure Wikis as possible, using the MediaWiki export feature rather than just grabbing page contents directly. This came in handy for thousands of wikis that were affected by Miraheze's disk failure and happened to have backups created by this project. Thanks to the domain-specific technique, the backups were high-fidelity enough that many users could immediately restart their wiki on another provider as if nothing happened.
They also try to "graze the rate limit" when a website announces a shutdown date and there isn't enough time to capture everything. They actively monitor for error responses and adjust the archiving rate accordingly, to get as much as possible as fast as possible, hopefully without crashing the backend or inadvertently archiving a bunch of useless error messages.
They are the middlemen that collects the data to be archived.
In this example the archivee (goo.gl/Alphabet) is simply shutting the service down and has no interest in archiving it. Archive.org is willing to host the data, but only if somebody brings it to them. Archiveteam writes and organises crawlers to collect the data and send it to Archive.org
(Source: ran a Warrior)
Is that the story, or you are saying that the machine was secured correctly but that running Warrior somehow introduced your network to risk?
If Internet Archive is a library, ArchiveTeam is people who run around collecting stuff, and gives it to the library for safe keeping. Stuff that are estimated/announced to be disappearing/removed soon tends to be focused too.
that already exists, its called CommonCrawl:
It’s smaller than Google’s index and Google does not represent the entirety of the web either.
For LLM training purposes this may or may not matter, since it does have a large amount of the web. It’s hard to prove scientifically whether the additional data would train a better model, because no one (afaik) not Google not common crawl not Facebook not Internet Archive have a copy that holds the entirety of the currently accessible web (let alone dead links). I’m often surprised using GoogleFu at how many pages I know exist even with famous authors that just don’t appear in googles index, common crawl or IA.
Hopefully it's not people intentionally allowing the Google crawler and intentionally excluding Common Crawl with robots.txt?
For digital preservation? We may discuss. For an LLM? Haha, no.
No, thank you.
Enlisting in the Fight Against Link Rot - https://news.ycombinator.com/item?id=44877021 - Aug 2025 (107 comments)
Google shifts goo.gl policy: Inactive links deactivated, active links preserved - https://news.ycombinator.com/item?id=44759918 - Aug 2025 (190 comments)
Google's shortened goo.gl links will stop working next month - https://news.ycombinator.com/item?id=44683481 - July 2025 (222 comments)
Google URL Shortener links will no longer be available - https://news.ycombinator.com/item?id=40998549 - July 2024 (49 comments)
Ask HN: Google is sunsetting goo.gl on 3/30. What will be your URL shortener? - https://news.ycombinator.com/item?id=19385433 - March 2019 (14 comments)
Tell HN: Goo.gl (Google link Shortener) is shutting down - https://news.ycombinator.com/item?id=16902752 - April 2018 (45 comments)
Google is shutting down its goo.gl URL shortening service - https://news.ycombinator.com/item?id=16722817 - March 2018 (56 comments)
Transitioning Google URL Shortener to Firebase Dynamic Links - https://news.ycombinator.com/item?id=16719272 - March 2018 (53 comments)
Per google, shortened links “won't work after August 25 and we recommend transitioning to another URL shortener if you haven’t already.”
Am I missing something, or doesn’t this basically obviate the entire gesture of keeping some links active? If your shortened link is embedded in a document somewhere and can’t be updated, google is about to break it, no?
But as I said in sibling comment to yours, I don't see the point of the distinction, why not just continue them all, surely the mostly unused ones are even cheaper to serve.
(In addition to the higher activity ones parent link says they'll now continue to redirect.)
They already have plenty of unused compute /older hardware / CDN POPs, performant distributed data store and everything else possibly needed .
It would be cheaper than the free credits they giveaway just one startup to be on GCP.
I don’t think infra costs are a factor in a decision like this .
Unless I'm just super smart (I'm not), it's pretty easy to write a URL shortener as a key-value system, and pure key-value stuff is pretty easy to scale. I cannot imagine that isn't doing something as or more efficient than what I did.
Obviously raw server costs aren't the only costs associated with something like this, you'd still need to pay software people to keep it on life support, but considering how simple URL shorteners are to implement, I still don't think it would be that expensive.
ETA:
I should point out, even something kind of half-assed could be built with Cloud Functions and BigTable really easily; this wouldn't win any kind of contests for low latency, but it would be exceedingly simple code and have sufficient uptime guarantees and would be much less likely to piss off the community.
If I had any idea how to reach out to higher-ups at Google I would offer to contract and build it myself, but that's certainly not necessary, they have thousands of developers, most of which could write this themselves in an afternoon.
Either way, we're talking about a dataset that fits easily in a 1U server with at most half of its SSD slots filled.
The list of short links and their target URLs can't be 91 TiB in size can it? Does anyone know how this works?
705 bytes is an extremely long URL. Even if we assume that URLs that get shortened tend to be longer than URLs overall, that’s still an unrealistic average.
https://web.archive.org/web/20250125064617/http://www.superm...
There used to be one such project (Pushshift), before the Reddit API change. You can download all the data and see all the info on the-eye, another datahoarder/preservationist group:
Not that I know of, and you haven't even been able to archive tweets on the Wayback machine for YEARS.
Given that Firebase (which powers the API link at the bottom of this page) is a Google property, I cannot possibly imagine why they'd differ
Even though all I did was setup the docker container one day and forget about it
I'm sure pastebin is filled with people's AWS credentials, too, but you don't see them randomly denying access to listings
The sibling link above that queries Wayback's warc index shows at least the first several are only 6 alnum wide so it's no wonder the ArchiveTeam got them in reasonable time
Picking one at random, it seems the super sekrit deets you're safeguarding include buyrussia21.co.kr which, yes, is for sure very, very secret
But, ok, let's continue in good faith
scenario 1: they don't want to uncork the .warc files because it will potentially leak the means and methods of the Archive Warrior or its usages
scenario 2: they don't want to expose the target of the redirects because it will feed the boundaries of the ravenous AI slurp machines
If it's scenario 1, then CSV exists and allows mapping from the 00aa11 codes to the "location:" header, no means and methods necessary
If it's scenario 2, then what the hell were they expecting to happen? Embargo the .warc until the AI hype blows over so their great grand children can read about how the Internet was back in the day? I guess the real question is "archive for whom?" because right now unless they have a back-channel way to feed the Wayback Machine's boundary using the .warc files, and thus it secretly populates the Wayback without wholesale feeding the AI boundary, this whole thing is just mysterious
Why can't I download the WARCs for some projects?
For many projects, WARCs are restricted and aren't available for download. In these cases, the archived data is still available for viewing in the Wayback Machine. (See this IRC log[2] for the general reasoning behind this.) If you'd like to discuss access to the data, please contact the Internet Archive.
[1] https://wiki.archiveteam.org/index.php/Frequently_Asked_Ques...
[2] https://irclogs.archivete.am/archiveteam-bs/2025-03-24#ld813...
sure maybe the warcs will be unlocked at some point in the future. this is a fairly small volunteer effort. i doubt there is some "unlock in 100 years" feature on IA.
How would that even function, I mean, did they loop through every single permutation and see the result, or what exactly/ how would that work?
In short, yes. Since no one can make new links, it's a pre-defined space to search. They just requested every possible key, and recorded the answer, and then uploaded it to a shared database.
It's like when the GPT links were archived and publicly available that contained sensitive information.
Especially with short links there's always the possibility of entering ~6 characters and getting a hit. So I believe expecting any secrecy from urls is silly.
That's like posting your passwords on Twitter because "Why would anyone find my account"
After all, these are just short links. They link to other things on the Internet. Which is inherently public anyways.
You cannot expect privacy via a simple URL. These short URLs are short, hence programmatically scraping all the URLs.
The GPT Links situation is nothing like this imo. Both however do come down to the stupid human aspect.