FilterHN

Ask HN: How to Resurrect a Site from Archive.org?

91 points

7 months ago

| 20 comments

I recently bought the expired domain of a niche interest site because the previous owner was determined to let it die and did not want to put any effort in it anymore.

Is there a way I can "revive" it from archive.org in a more or less automated fashion? Have you ever encountered anything like it? I am familiar with web scraping, but archive.org has its peculiarities.

I really, really love the content on it.

It's a very niche site, but I would love for it to live on.

▲

duskwuff

7 months ago

[-]

> I recently bought the expired domain of a niche interest site because the previous owner was determined to let it die and did not want to put any effort in it anymore. Is there a way I can "revive" it from archive.org in a more or less automated fashion?

Buying a domain name does not award you ownership of the content it previously hosted. If you have not come to some agreement with the previous owner, you should not proceed.

▲

aspenmayer

7 months ago

[-]

Well we can't really assume either way, as OP was vague about how the site was left abandoned. They may have some arragement that would make this not copyright infringing. In the absence of any affirmative assent in writing reviewed by legal counsel, I'd be inclined to agree with you, and yet I sought to provide the best answer to the question provided, as the legal issues were outside the scope of the question as asked, and the legal issues you raised seem obvious to you and I, and ought also to be so to OP, but we can't make assumptions about the license of the content in question and/or the relevant jurisdiction(s), which may make these points all moot.

▲

moralestapia

7 months ago

[-]

How's that different from the site being hosted at archive.org?

▲

karel-3d

7 months ago

[-]

archive.org approach to copyright is "look, squirrel".

▲

fastily

7 months ago

[-]

“Fair use”

▲

lhamil64

7 months ago

[-]

What if OP just had the domain redirect to the archive.org page? Then they wouldn't be hosting the content themselves

▲

ksec

7 months ago

[-]

Which is the part that really annoys me. The owner would much rather shut it off than selling it or letting others to run even in archive mode.

I recently learned CGTalk was completely shut down and ALL the information shared over the pass 20 years are gone. It never received the attention like DPreview. There are plenty of other examples where forum owner no longer wants the burden of owning it.

It really is a sad state of things.

Is there a site or exchange somewhere where owner could sell their site or at least put up a whole archive as asset?

▲

ulrischa

7 months ago

[-]

I’ve seen a lot of people do this when resurrecting old niche sites. The high-level approach usually involves grabbing all the snapshots from archive.org, stripping out their timestamped URLs, and consolidating everything into a local mirror. In practice, you want to:

1. Collect a list of archived URLs (via archive.org’s CDX endpoints). 2. Download each page and all related assets. 3. Rewrite all links that currently point to `web.archive.org` so they point to your domain or your local file paths.

The tricky part is the Wayback Machine’s directory structure—every file is wrapped in these time-stamped URLs. You’ll need to remove those prefixes, leaving just the original directory layout. There’s no perfect, purely automated solution, because sometimes assets are missing or broken. Be prepared for some manual cleanup.

Beyond that, the process is basically: gather everything, clean up links, restore the original hierarchy, and then host it on your server. Tools exist that partially automate this (for example, some people have written scripts to do the CDX fetching and rewriting), but if you’re comfortable with web scraping logic, you can handle it with a few careful passes. In the end, you’ll have a mostly faithful static snapshot of the old site running under your revived domain.

▲

Gualdrapo

7 months ago

[-]

I was commissioned to recover ideawave.ca from archive.org as its owner lost its database so pretty much all what was left was only on archive.org. I think it was under WordPress but he asked me to port it to Jekyll.

I scraped its contents (blog posts, pages, etcetera) with Python's beautifulsoup and redid its styling "by hand", which was not something otherworldy (the site was from line 2010 or so) and had the chance to put some improvements.

The thing with the scraping was that the connection was lost after a while and it was reaaaaaaaaaally sloooooooooow so I had to keep a register on memory of what was the last successful scraped post/page/whatever and, if something happened, restart from it as a starting point.

Got pennies for it, mostly because I lowballed myself, but got to learn a thing or two.

▲

janesvilleseo

7 months ago

[-]

This something that used to be done quite a bit in the SEO world. Not sure if still holds and SEO value. Probably some, but maybe not the same level.

Anyways there are tools out there. I haven’t used them

But a tool like https://www6.waybackmachinedownloader.com/website-downloader...

https://websitedownloader.com/

Should do the trick. Depending on the size of the site a small cost is involved.

They can even package them into unusable files.

▲

cdr420

7 months ago

[-]

I'm hoping you meant "usable" and not "unusable". Or maybe you did. Funny either way!

▲

moxvallix

7 months ago

[-]

You can use wayback_machine_downloader to automate downloading the archived pages https://github.com/hartator/wayback-machine-downloader/

▲

d3VwsX

7 months ago

[-]

That used to work great for me, but recently it started to fail. It downloads a few pages but then it gets errors, as if it is detected and prevented by the server from scraping.

▲

mediumsmart

7 months ago

[-]

There is a fix for that. This might point the way: https://github.com/hartator/wayback-machine-downloader/issue...

▲

toomuchtodo

7 months ago

[-]

> as if it is detected and prevented by the server from scraping.

Yes.

▲

latexr

7 months ago

[-]

Have you tried searching for your question online? I found plenty of results.

https://superuser.com/questions/828907/how-to-download-a-web...

▲

aspenmayer

7 months ago

[-]

Specifically:

https://wiki.archiveteam.org/index.php?title=Restoring

which mentions

https://github.com/hartator/wayback-machine-downloader

and also this tip:

> This is undocumented, but if you retrieve a page with id_ after the datecode, you will get the unmodified original document without all the Wayback scripts, header stuff, and link rewriting. This is useful when restoring one page at a time or when writing a tool to retrieve a site:

> http://web.archive.org/web/20051001001126id_/http://www.arch...

From the downloader's issues, you may or may not need to use this forked version if you encounter some errors:

https://github.com/hartator/wayback-machine-downloader/issue...

https://github.com/ShiftaDeband/wayback-machine-downloader

▲

captn3m0

7 months ago

[-]

The underscore trick is what I used to revive HgInit (now lives at hginit.github.io). But it wasn’t a lot of pages, so it wasn’t scripted.

▲

01jonny01

7 months ago

[-]

Gosh. No one answers the question directly.

1) Download HTTrack if its a large websit with alot of pages 2) Download Search and Replace program, theres many of them. 3) The search and replace program allows you to remove the appended web archive url from the pages in bulk. 4. Upload to your host. 5. Run the site through a bulk link checker, that test for broken links. There is plenty of them online.

▲

bagpuss

7 months ago

[-]

Archivarix is the most fully formed easiest way to do this, free https://archivarix.com/

▲

aspenmayer

7 months ago

[-]

This site is misrepresenting itself as open source and free, while simultaneously having an affiliate program and pricing page, which, as I've said, isn't free. It's unverifiable whether or not it's open source, as you don't even download/run the software yourself: it's a web app, which is beside the point, as web apps could also be open source, but as there's no way to self-host this, let alone download it and/or run it, for free or otherwise. I think it's safe to avoid this scammy site.

None of my ire is directed at you, as I don't assume you knew any of this. I just wanted to let you know, in case you were mislead as to what the site does by its ad copy.

https://archivarix.com/en/affiliate/

https://archivarix.com/en/#show-prices-wbm

▲

KomoD

7 months ago

[-]

> This site is misrepresenting itself as open source and free

It's not.

They don't say that the site and all the services they offer are free and open-source, they say that the Archivarix CMS is free and open-source (GNU GPLv3), which it is...

> as you don't even download/run the software yourself

You can download the CMS.

> but as there's no way to self-host this, let alone download it and/or run it, for free or otherwise

Again yes you can both download it and self-host the CMS

> I think it's safe to avoid this scammy site.

It's scammy because they're not offering everything for free and open-source even though they never said they would?

https://archivarix.com/en/cms/

▲

aspenmayer

7 months ago

[-]

Unless the CMS lets you backup/restore Internet Archive sites, then that is literally off-topic, beside the point, and doesn’t make sense given the context of ‘bagpuss’s comment. That ‘bagpuss was vague about what they said was free doesn’t change the context of the discussion.

I stand by what I said, as the Internet Archive feature, which is the entire point of OP’s post, is not free on their platform. The CMS is not relevant to this discussion.

It’s scammy because the kind of people who would use this wouldn’t know how many files are in the backup because they are likely no/low-context users who are likely not familiar with concepts like “average or expected number of files on a website.” The pricing is usurious and exploitative due to the pricing model being per file versus by file size for example.

▲

KomoD

7 months ago

[-]

> then that is literally off-topic

It's not, you're claiming they said something they didn't.

> and doesn’t make sense given the context of ‘bagpuss’s comment. That ‘bagpuss was vague about what they said was free doesn’t change the context of the discussion.

I don't care about bagpuss's comment, they don't represent Archivarix as far as I can tell.

You said the site is misrepresenting itself, it's not.

You said the site is claiming things they haven't.

You called the site "scammy" based on something that they never even claimed.

> It’s scammy because the kind of people who would use this wouldn’t know how many files are in the backup

Archive.org tells you how many URLs are saved.

Example: https://web.archive.org/details/https://sweetcode.io

2109+4716+595+732+562+90+28+1+9+1 = 8843 unique URLs.

First file is free. First 1000 files are $0.01 each. Additional thousands are $1 per thousand.

So the price would be $17.84

▲

aspenmayer

7 months ago

[-]

I was responding to ‘bagpuss because they made a claim that the site linked was respondent to OP, and was free. You and I can quibble about what they meant, but a plain reading of their comment implies that the functionality that OP asked for was free, because ‘bagpuss never mentioned the CMS, and in fact the CMS seems like a red herring in this discussion entirely.

I do feel that the site misrepresents the value proposition of the Internet Archive backup/restore service, because the site’s value proposition is convenience for users who don’t know that there are actually free, actually open source ways to backup and restore content from Internet Archive, and that site isn’t it. They’re banking on users not knowing any better in that case, which isn’t unethical per se, buyer beware etc, but it’s shady.

That above combined with the pricing model makes it scammy because you have to spend a minimum of $10 in crypto or other non-reversible payment for something that should not cost the user anything, as the Internet Archive is bearing the lion’s share of the costs. And if it doesn’t do what you needed, you’ve already paid in worthless credits.

https://archivarix.com/en/tutorial/#list-3

> Second example: the big site contains 25,520 files. From this quantity you can deduct 1 because they will be free of charge. So we have 25,519 paid files. First thousand will cost $10, and the rest 24,519 costs only $1 per thousand, therefore $24.519 . Full price for the big site recovery is $34.52!!!

$34.52 is not a reasonable price for this by any means.

That said, I make no claims about the site being respondent to OP’s request, as I’m not OP. I simply rejected the claims brought by ‘bagpuss.

▲

KomoD

7 months ago

[-]

> but a plain reading of their comment implies that the functionality that OP asked for was free, because ‘bagpuss never mentioned the CMS, and in fact the CMS seems like a red herring in this discussion entirely.

What I'm saying is YOU said THE SITE was misrepresenting itself when THE SITE isn't. It would've been BAGPUSS that was misrepresenting THE SITE if anyone.

> for something that should not cost the user anything, as the Internet Archive is bearing the lion’s share of the costs.

It's still costing Archivarix money to run the service, yes you are paying for convenience, I see nothing wrong with that whatsoever.

Ideally the Internet Archive should provide an easy way to download sites but they don't.

> $34.52 is not a reasonable price for this by any means.

Why is it not reasonable? They spent time developing this service and it costs money to run, if you want to save money then yeah you can recover it yourself with some open-source software like wayback-machine-downloader, but some people just want to recover sites without having to bother with any of that.

▲

aspenmayer

7 months ago

[-]

> What I'm saying is YOU said THE SITE was misrepresenting itself when THE SITE isn't. It would be BAGPUSS that was misrepresenting THE SITE.

Both of these things can be true, that ‘bagpuss was misrepresenting the site, and the site is intentionally vague as to what is free and what isn’t so as to muddy the waters and paint themselves as saviors and good people for being open source while overcharging for a product to the degree that the site misrepresents itself, and I believe that they both are true.

> Ideally the Internet Archive should provide an easy way to download sites but they don't.

I agree, but that’s not really relevant to our discussion or to ‘bagpuss’s claims.

And if IA did provide an easy way to do that, the site linked would be an even worse deal.

The site is misrepresenting itself as being worth paying for at any price.

Furthermore, you can download an entire site using your web browser ‘Save page as’ -> ‘web page, complete’ dialog in conjunction with the undocumented trick:

> This is undocumented, but if you retrieve a page with id_ after the datecode, you will get the unmodified original document without all the Wayback scripts, header stuff, and link rewriting.

Seems pretty easy to me, but only if you know how. Which is the only reason anyone would use that site - they simply don’t know how bad a deal the site is, or they have more dollars than sense.

▲

KomoD

7 months ago

[-]

> and the site is intentionally vague as to what is free

It's not? It says the CMS is free and open-source and they have prices listed for the paid services they provide.

> and paint themselves as saviors and good people for being open source while overcharging for a product to the degree that the site misrepresents itself, and I believe that they both are true.

Simply saying that something is open-source is you painting yourself as a "savior"?

> And if IA did provide an easy way to do that, the site linked would be an even worse deal.

Obviously, if they did provide it then there would be no reason at all to pay.

> Furthermore, you can download an entire site using your web browser ‘Save page as’ -> ‘web page, complete’ dialog in conjunction with the undocumented trick:

No, not an entire site, just the current HTML document and the accompanying files for it (e.g. scripts, images, etc.) If you want to sit for hours manually doing that for thousands of pages then feel free.

▲

aspenmayer

7 months ago

[-]

Do you work for the site or something?

> It's not? It says the CMS is free and open-source and they have prices listed for the paid services they provide.

You brought up the CMS. I didn’t. I don’t have any point to defend regarding it. ‘bagpuss was wrong about what they said about the site, and I replied to that.

> Simply saying that something is open-source is you painting yourself as a "savior"?

It’s called marketing.

Are you unfamiliar with what scammy means? The site feels scammy to me. So I said so. I don’t think you can demonstrate that I don’t believe it’s scammy, and you haven’t convinced me either.

> Obviously, if they did provide it then there would be no reason at all to pay.

I don’t have any reason to pay either. ‘bagpuss can defend the scammy site, but I won’t so I agree there’s no reason to pay, for different reasons.

> No, not an entire site, just the current HTML document and the accompanying files for it (e.g. scripts, images, etc.) If you want to sit for hours manually doing that for thousands of pages then feel free.

I have no reason to believe a scammy site will do any better than that either. You haven’t demonstrated that the site even works, and their marketing doesn’t inspire confidence.

As I didn’t introduce the site, I’m not beholden to supporting it or not. Take ‘bagpuss to task if anyone.

I don’t think you know what you’re even arguing about or for because none of your arguments or claims even go anywhere, they all revolve around this scammy site that you didn’t even bring up. Nothing about your argument makes sense.

That you haven’t made any effort to correct ‘bagpuss by replying to them directly is curious.

▲

KomoD

7 months ago

[-]

> Do you work for the site or something?

Nope.

> You brought up the CMS. I didn’t.

Yes, again, it was to explain to you that they only say that the CMS is free, not the services because you said:

> This site is misrepresenting itself as open source and free, while simultaneously having an affiliate program and pricing page, which, as I've said, isn't free

They only said their CMS was open-source and free, not any of their other services.

> I don’t think you know what you’re even arguing about or for because none of your arguments or claims even go anywhere

I was correcting you because you said things that just aren't true:

> This site is misrepresenting itself as open source and free, while simultaneously having an affiliate program and pricing page, which, as I've said, isn't free. It's unverifiable whether or not it's open source, as you don't even download/run the software yourself: it's a web app, which is beside the point, as web apps could also be open source, but as there's no way to self-host this, let alone download it and/or run it, for free or otherwise. I think it's safe to avoid this scammy site.

Which is just not accurate at all, as I've already explained several times. You can dislike the site all you want but you don't need to slander them.

▲

aspenmayer

7 months ago

[-]

>> to clarify, i have nothing to do with this site, i used it once, years back and there was a free tier or at least a free/crippled version at that time

https://news.ycombinator.com/item?id=42291616

Per ‘bagpuss, the backup was free when they used it, and they were referring to the backup, not the CMS.

So, I would argue you were mistaken.

▲

bagpuss

7 months ago

[-]

> to clarify, i have nothing to do with this site, i used it once, years back and there was a free tier or at least a free/crippled version at that time

posters, enhance your calm

- bagpuss, fat furry cat puss

▲

aspenmayer

7 months ago

[-]

As it’s not free anymore, do you still recommend using it, or do you have a different alternative recommendation in light of it no longer being free?

I appreciate your feedback. Not sure why ‘KomoD is defending the site, but at least you understand that it’s relevant whether it’s free or not.

▲

toast0

7 months ago

[-]

I did this for a niche site, but it was only 20 pages.

I pulled each page off internet archive, saved it as an archive; then did some minor tidying up, setting viewports for mobile, updating the linkback html snippet to go to my url instead of the old dead one, changing the snippet to not suggest hotloading the link image, crop the dead url out of the link image, pngcrush the images, put it on cheap hosting for static pages.

I did a bit of poking around trying to find a way to contact the owner, but had no luck. If they come back and want it down, I'll take it down. Copyright notices are intact. I'm clearly violating the author's copyrights, and I accept that.

▲

gopher_space

7 months ago

[-]

> I'm clearly violating the author's copyrights, and I accept that.

I'm looking at combining several old message boards into something useful, and I'd like to be proactive regarding copyright. My approach so far:

- I'm assuming that everyone owns their own post/comment.

- I'm assuming that submitting content meant they intended to grant rights to community members.

- I'm assuming that work done in support of the original community would be welcomed by members.

- And I'm assuming this all changes if I want money.

So I'm preserving attributions when I can, but treat content like it's CC or similar as long as I'm operating within the original authors area of concern. Anything that actually gets released will be as open as possible... and probably start with telling you how to download files. Entirely walling off my code makes sense but then it is no longer a fun little project, it is a framework.

▲

Sysreq2

7 months ago

[-]

You could also consider using the Common Crawl dataset provided by Amazon. Archive.org is more or less a wrapper around it anyways.

https://registry.opendata.aws/commoncrawl/

▲

paxys

7 months ago

[-]

Have you spoken to the previous owner about any of this? Otherwise it's pretty crazy to just take ownership of the site and all its content without a written agreement in place. You are opening yourself up to a massive amount of liability for no reason.

▲

aspenmayer

7 months ago

[-]

I agree with your points, but as the original host of the site no longer is continuing to host it, I doubt they would be any more interested in what others do regarding it, but a lawsuit with a potential payday might motivate them. I broadly agree with you though.

▲

aoipoa

7 months ago

[-]

This was posted 6 days ago but it's reappeared now 4 hours ago. What happened?

https://hn.algolia.com/?q=ask+hn+resurrect+site+archive

Very odd.

Even the times of the comments have changed, this is what the post looked like yesterday:

https://web.archive.org/web/20241205054108/https://news.ycom...

▲

denotational

7 months ago

[-]

HN has a “resubmit” mechanism whereby the mods can resubmit interesting posts if they think they might stimulate more interest by being posted at a different time (or just by having better luck).

To avoid a dupe, this mechanism post-dates the original post.

▲

alsetmusic

7 months ago

[-]

I’ve been thinking about buying a sibling domain (.net instead of .com) to re-host a fantastic essay that disappeared from the web some years back. I would make it clear that I didn’t write it and offer to remove it if the original author contacted me requesting that I remove it (it did not include attribution in its original form). But the issue has been enough of a grey area that I haven’t pulled the trigger.

For anyone who may be curious, wayback machine has an archive: fuckthesouth.com

▲

Alifatisk

7 months ago

[-]

HTTrack? You should not do it without the owners consent though.

▲

aspenmayer

7 months ago

[-]

Seems legit.

> HTTrack is a free (GPL, libre/free software) and easy-to-use offline browser utility.

Available on Windows, Mac, Linux, and Android.

▲

pabs3

7 months ago

[-]

Unless you are going to continue to run the site and have it change etc, there is no point doing this since archive.org already hosts static snapshots of sites.

Depending on the site you would use different tools, for eg for MediWiki/DokuWiki sites you would import the latest database dump on archive.org.

I have used wayback-machine-downloader before for completely static sites before:

https://github.com/hartator/wayback-machine-downloader/

▲

donalhunt

7 months ago

[-]

Did this 10+ years ago for a circa-2000 band website (was a few html pages). Was fairly straightforward to achieve. Some content (embedded from 3rd party websites) was not recoverable.

▲

joshdavham

7 months ago

[-]

Can I ask what site it was? Reading this made me think of a very specific site that I'd also like to see revived and I'm wondering if we're thinking of the same site.

▲

davidjhall

7 months ago

[-]

I have a similar one -- site just sent down a few days ago. A special library....

▲

canU4

7 months ago

[-]

Isn't a simple wget -r enough?

▲

ddgflorida

7 months ago

[-]

web scraping but be careful about using copyrighted images.

▲

comboy

7 months ago

[-]

  wget --mirror --convert-links --page-requisites --no-parent URL

But yeah it's also not clear to me regarding copyrights and such.