ArchiveBox is evolving: the future of self-hosted internet archives
636 points
1 day ago
| 27 comments
| docs.sweeting.me
| HN
We've been pushing really hard over the last 6mo to develop this release. I'd love to hear feedback from people who've worked on big plugin systems in the past, or anyone who's tried our betas!
bravura
1 day ago
[-]
@nikisweeting ArchiveBox is awesome and we'd really love it to be more awesome. And sustainable!

I've posted issues and PRs for showstopper issues that took months to get merged in: https://github.com/ArchiveBox/ArchiveBox/issues/991 https://github.com/ArchiveBox/ArchiveBox/pull/1026

You have the opportunity for the community to lean in on ArchiveBox. I understand it's hard to do everything as a solo dev, we've seen many cases in the community where solo devs get burned out or have personal challenges that take priority etc.

It's hard for us users to lean in on ArchiveBox when after a happy month of archiving, things start break and you're left with maintaining a branch of your own fixes that aren't in main. Meanwhile, your solution of soliciting one time donations just makes the whole project feel more rickety and fly-by-night. How about thinking bigger?

We NEED ArchiveBox to be a real thing. Decentralized tooling for archiving is SO IMPORTANT. I care about it and I suspect many people do. I'm posting this so other people who care about it can also comment and chime in and suggest how it can become something we can rely on. Because archiving isn't just about the past, it's about the future.

Maybe it needs to be a dev org of three committed part-time maintainers, and a small foundation that people recurrently support is what grants it? IDK. I'm not an expert at how to make open source resilient. There have been discussions about this in the past, but I think it's worth a serious look because ArchiveBox is IMPORTANT and I want it to work any month I decide to re-activate my interest in it. I invite people to discuss ways to make this valuable project more sustianable and resilient.

reply
nikisweeting
1 day ago
[-]
Let chat more. I'm almost ready to raise some seed money, hire a second staff dev or find a cofounder, and I'm looking for people that care deeply about the space.

It's only been during the last few months that I decided to go all in on the project, so this is still just the first few pages of a new chapter in the project's history.

(I should also mention that if you're a commercial entity relying on ArchiveBox, you can hire us for dedicated support and uptime guarantees. We have a closed source fork that has a much better test suite and lots of other goodies)

reply
nyx
1 day ago
[-]
It looks like you're doing great work here, thanks a bunch; looking forward to seeing this project develop.

Selling custom integrations, managed instances, white-glove support with an SLA, and so on seems like a reasonable funding model for a project based on an open-source, self-hostable platform. But I'm a little disheartened to read that you're maintaining a closed fork with "goodies" in it.

How do you decide which features (better test suite?) end up in the non-libre, payware fork of your software? If someone contributed a feature to the open-source version that already exists in the payware version, would you allow it to be merged or would you refuse the pull request?

reply
nikisweeting
1 day ago
[-]
The idea with the plugin system is that plugins are just git repos containing <pluginname>/__init__.py, and you can add any set of git repo plugins you want to your instance.

The marketplace will work by showing all git repos tagged with the "archivebox" tag on github.

My approval is only needed for PRs to the archivebox core engine.

More info on free vs paid + reasoning why it's not all open source: https://news.ycombinator.com/item?id=41863539

reply
bigiain
1 day ago
[-]
"I too would like commit access to your promising looking project's git repo and CI/CD pipeline. Thanks, Jia Tan"
reply
giancarlostoro
1 day ago
[-]
Do you guys have a Discord by chance? I have a close friend who is insanely passionate about archiving, he has a personal instance of archivebox, and is working on a Video Downloading project as well. He has used it almost everyday and archived thousands of news articles over years. He's aware of a lot of the nuances.
reply
nikisweeting
1 day ago
[-]
We have a Zulip which is similar to discord (but self hosted and it has better threading): https://zulip.archivebox.io
reply
manofmanysmiles
1 day ago
[-]
I love this project. I "independently" "invented" it in my head the other day, and happy to see it already exists!

I'd love to see blockchain proof/notary support. The ability to say "content matching this hash existed at this time.

I'm exceptionally busy now but that being said, I may choose to contribute nonetheless.

I'd love to connect directly, and will connect to the Zulip instance later.

If we align on values, I may be able to connect you with some cash. People often call me an "anarchist" or "libertarian", though I'm just me, not labels necessary.

reply
nophunphil
1 day ago
[-]
Can you please explain what you mean by “blockchain proof/notary support”?
reply
manofmanysmiles
1 day ago
[-]
Motivation: Have evidence that some content existed at a particular time. For example, let's say a major website publishes an article, and later they remove it, and there is no record of it ever existing. If I host an ArchiveBox, I can look at it and see "Oh here is that article. Looks line it was published after all." However, why should you believe me I didn't just make it up?

If when I initially archived it, I computed a cryptographic hash of the content and posted that on a blockchain, then at a future date I can at least claim "As of block N, approximately corresponding to this time UTC, content that hashes to this hash exited."

If multiple unrelated parties also make the same claim, it is stronger evidence.

Is this sufficient explanation? I can expand on this more later.

reply
jazzyjackson
1 day ago
[-]
There's no reason to believe that the hashed and timestamped content was hosted at a particular domain, however (unless the content was signed by the author of course, then there's no Blockchain necessary). sure multiple peers could make some attestation that they saw it at that URL, but then you're back at square one of the reputation problem

Internet archive as an institution with a reputation that holds up to a judge is actually more valuable than a cryptographic proof that x bytes existed at y time

reply
manofmanysmiles
1 day ago
[-]
No, definitely not. I have no inherent reason to trust the people working at the Internet Archive over let's say close friend. For me trust is always a human to human concept, and no amount of tech or institutions will change that.

The more people I hear making a claim, the more I'm likely to deem the claim(s) as true. This is even true regarding the claims that cryptographic algorithms have the properties that make them useful in these contexts. I say this as someone who has even taken graduate level classes with Ron Rivest.

I'm not sure what will happen in a court. I imagine the more people that start making claims using cryptography as part of the supporting evidence, the more likely people will start to trust cryptography as a useful tool for resolving disputes about the veracity of claims.

So you would not get any value from multiple people making such claims?

reply
jazzyjackson
19 hours ago
[-]
Wow, thanks for sharing your perspective it's quite different from mine. For me reality is not democratic, number of people making a claim doesn't influence the truthiness of it.

I bring up judges because Internet archive captures have been used as evidence in court cases, the first one I pulled up [0] makes an interesting distinction on whether the archive's snapshots are merely hearsay:

  The hearsay rule does not apply to the document (so far as it contains the representation) if the representation was made:

  (a)    by a person who had or might reasonably be supposed to have had personal knowledge of the asserted fact; or ...
The archive's office manager submitted an affidavit to the court as someone who would have personal knowledge of the fact that the date and claimed availability of the content are accurate. There's no cryptography involved, just an individual and an institutions reputation - this carries much more weight than any number of anonymous individuals attesting to a cryptographic proof

[0] https://www.judgments.fedcourt.gov.au/judgments/Judgments/fc...

reply
nikisweeting
1 day ago
[-]
I think the best solution is to have multiple people with reputation attest to the encrypted TLS content without being able to see the cleartext of it, that way they cant easily tamper with it.

See my comments on TLSNotary stuff below...

reply
manofmanysmiles
1 day ago
[-]
Woah, cool, yes, exactly this!

I think I read a paper or blog post about this concept a while ago, but never saw it implemented!

reply
toomuchtodo
1 day ago
[-]
https://github.com/ArchiveTeam/grab-site might be helpful. I'm a fan of the ability to create WARC archives from a target, uploard the WARC files to object storage (whether that is IA, S3, Backblaze B2, etc), and then keep them in cold storage or serve them up via HTTPS or a torrent (mutable, preferred). The Internet Archive serves a torrent file for every item they host; one can do the same with WARC archives to enable a distributed archive. CDX indexes can be used for rapidly querying the underlying WARC archives.

You might support cryptographically signing WARC archives; Wayback is particular about archive provenance and integrity, for example.

https://www.loc.gov/preservation/digital/formats/fdd/fdd0005... ("CDX Internet Archive Index File")

https://www.loc.gov/preservation/digital/formats/fdd/fdd0002... ("WARC, Web ARChive file format")

https://github.com/internetarchive/wayback/tree/master/wayba... ("Wayback CDX Server API - BETA")

reply
nikisweeting
1 day ago
[-]
I recommend Browsertrix for WARC creation, I think they are the best currently available for WARC/WACZ.

ArchiveBox is also gearing up to support real cryptographic signing of archives using https://tlsnotary.org/ in an upcoming plugin. (in a way that actually solves the TLS non-repudation issue, which traditional "signing a WARC" does not, more info: https://www.ndss-symposium.org/wp-content/uploads/2018/02/nd...)

reply
toomuchtodo
1 day ago
[-]
Keep in mind, what signing methodology you use is a function of who accepts it. If I can confirm "ArchiveTeam ripped this", that is is superior to whatever tlsnotary is doing with MPC, blockchain, distributed ledger, whatever (in my use case). Have to trust someone at the end of the day. ArchiveTeam's Warrior doesn't use tlsnotary, for example, and rips entire sites just fine.
reply
nikisweeting
1 day ago
[-]
The idea with TLSNotary is that you can have several universities or central agencies running signing servers but you dont have to share the cleartext content of your archives with them to get it signed.

This dramatically changes what is possible with signing because previously to get ArchiveTeam's signature of approval, they would have to see the content themselves to archive it. With TLSNotary they can sign without needing to see the content/access the cookies/etc.

reply
viraptor
1 day ago
[-]
Isn't that already possible with any kind of notary by giving them a sha256 of the content only? Or am I missing some distinction?
reply
nikisweeting
1 day ago
[-]
You can do that but it proves nothing because TLS session keys are symmetric, so the archiver can forge server responses and falsely attest that the server sent them.

Look up "TLS non repudiation"

A real solution like TLSNotary involves a neutral, reputable third party that can't see the cleartext attesting to the cyphertext using a ZK proof.

The neutral third party doing attestation can't see the content so they can't easily tamper with it, and attempts to tamper indiscriminately would be easily detected and ding their reputation.

reply
digitaldragon
1 day ago
[-]
Unfortunately, Browsertrix relies on the Chrome Devtools Protocol, which strips transfer encoding (and possibly transforms the data in other ways). This results in Browsertrix writing noncompliant WARC files, because the spec requires that the original transfer encoding be preserved.
reply
ikreymer
1 day ago
[-]
Unfortunately, there is not much we can do about transfer-encoding, but the data is otherwise exactly as is returned from the browser. Browsertrix uses the browser to create web archives, so users get an accurate representation of what they see in their browser, which is generally what people want from archives.

We do the best we can with a limited standard that is difficult to modify. Archiving is always lossy, we try to reduce that as much as possible, but there are limits. People create web archives because they care about not losing their stuff online, not because they need an accurate record of transfer-encoding property in an HTTP connection. If storing the transfer-encoding is the most important thing, then yes, there are better tools for that.

reply
CorentinB
1 day ago
[-]
You could use a proxy.

"Archiving is always lossy" No.

reply
nikisweeting
1 day ago
[-]
You're talking to the guy who built the best proxy recorder in the archiving industry ;) ikreymer created https://pywb.readthedocs.io/en/latest/

I think he has more context than any of us on the limits of proxy archiving vs browser based archiving.

But also if you really need perfect packet-level replication, just wireshark it as he said. Why bother with WARCs at all?

reply
pabs3
1 day ago
[-]
pywb has WARC issues too, due to use of warcio:

https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem

reply
ikreymer
1 day ago
[-]
Every archiving tool out there makes trade-offs about what is archived and how. No one preserves the raw TLS encrypted H3 traffic because that's not useful. When you browse through an archiving MITM proxy, there are different trade-offs: there's an extra HTTP connection involved (that's not stored), a fake MITM cert, and a downgrade of H2/H3 connection to HTTP/1 (some sites serve different content via H2 vs HTTP/1.1, can detect differences, etc...)

The web is best-effort, and so is archiving the web.

reply
pzmarzly
1 day ago
[-]
Can you recommend some tools to manage mutable torrents? I.e. create them, edit them, download them and keep them downloaded up to date.

BTW I recently tried using IPFS for a mutable public storage bucket and that didn't go well - downloads were very slow compared to torrents, and IPNS update propagation took ages. Perhaps torrents will do the job.

reply
nikisweeting
1 day ago
[-]
My plan is to use a separate control plane for the discovery/announcements of changes, and torrents just for the data transfer. The specifics are still being actively discussed, and it's a few releases away anyway.
reply
Apocryphon
1 day ago
[-]
Man, looks like the first posts about IPFS cropped up on HN a decade ago. I remember seeing Neocities announcement of support for them. I wonder if that protocol has gotten anywhere since then.
reply
jazzyjackson
1 day ago
[-]
There has been a large effort extended by Internet archive to adopt IPFS through their partnership with filecoin but IME the basic problems of the protocol remain - slow egress, slow discovery, someone still has to serve the file over a gateway to normie HTTP users...
reply
0cf8612b2e1e
1 day ago
[-]

  The Internet Archive serves a torrent file for every item they host
I had no idea. I have found the IA serving speed to be pretty terrible. Are the torrents any better? Presumably the only ones seeding the files are IA themselves.
reply
toomuchtodo
1 day ago
[-]
The benefit is not in seeding speed directly from IA, but the potential for distributed access and seeding of the item. Think of it as a filename of a zip file in a flat distributed filesystem, with the ability to cherrypick files that make up the item out via traditional bittorrent mechanisms. Anyone can consume each item via torrent, continue to seed, and then also access the underlying data. IA acts as the storage system of last resort (and the metadata index).
reply
pabs3
1 day ago
[-]
The torrents have better speeds because they have WebSeeds for multiple IA servers, so you can download from multiple servers at once.
reply
bityard
1 day ago
[-]
So, after reading through the comments and website, I just realized I used ArchiveBox a month or two ago for a very specific purpose.

You see, I inherited a boat.

This boat belonged to my father. He was not materialistic but he took very good care of the things he cared about, and he cared about this boat. It's an old 18' aluminum fishing/cruising boat built in the early 1960's. It's not particularly valuable as a collectible but it is fairly rare and has some unique modifications. I spent a lot of time trying to dig up all of the info that I could on it, but this is one of those situations where most of the companies involved have been gone for decades and most everyone who was around when these were made are either dead or not really on the Internet.

It's a shame that I waited so long to start my research because 10 or 20 years ago, there were quite a few active web forums containing informational/tutorial threads from the proud owners of these old boats. I know because I have seen references to them. Some of the URLs are in archive.org, some are not. But the forums are gone, so a large chunk of knowledge on these boats is too, probably forever.

I did manage to dig up some interesting articles, pictures, and forum threads and needed a way to save them so that they didn't disappear from the web as well. There is probably an easier way to go about it, but in the end I ran ArchiveBox via Docker and set it to fetching what I could find and then downloaded the resulting pages as self-contained HTML pages.

reply
shiroiushi
1 day ago
[-]
>because 10 or 20 years ago, there were quite a few active web forums containing informational/tutorial threads from the proud owners of these old boats. ... But the forums are gone, so a large chunk of knowledge on these boats is too, probably forever.

These days, that kind of info would be locked up in a closed Discord chat somewhere, so you can forget about people 20 years from now ever seeing it.

reply
stavros
1 day ago
[-]
Or people today ever discovering it.
reply
Magnets
1 day ago
[-]
Lots of private groups on facebook too
reply
nfriedly
1 day ago
[-]
I've been using an instance of https://readeck.org/ for personal archives of web pages and I really like it, but I might try out ArchiveBox at some point too.

I also run an instance of ArchiveTeam Warrior which is constantly uploading things to archive.org, and I like the direction ArchiveBox is heading with the distributed/federated archiving on the roadmap, so I may end up setting up an instance like that even if I don't use it for personal content.

reply
venusenvy47
1 day ago
[-]
I've been using the Single File extension to save self-contained html files of pages I want to keep for posterity. I like it because any browser can open the files it creates. Is it easy to view the archive files from readeck? I haven't looked at fancier alternatives to my existing solution.

https://addons.mozilla.org/en-US/firefox/addon/single-file/

reply
nikisweeting
1 day ago
[-]
Singlefile is excellent, Gildas is a great developer. ArchiveBox has had singlefile as one of its extractors built in for years :)
reply
gildas
1 day ago
[-]
Thank you so much Niki :). The P2P sharing is a great idea. I really hope this feature will get things moving in the archiving field.
reply
ninalanyon
1 day ago
[-]
Readeck saves a page as a zip file. It's not hard to open from the command line or file manager, just unzip and launch the index.html in the web browser.

But it strips out a lot of detail. Zipping it also means that it's hard to deduplicate. I use WebScrapBook and run rdfind to hardlink all the identical files.

reply
nfriedly
1 day ago
[-]
I haven't looked at the on-disk format, I just use the browser interface. (It's fairly common for me to save something from my phone that I'll want to review on a computer later.)

Here's an example of an Amazon "review" I recently archived that has instructions for using a USB tester I have: https://readeck.home.nfriedly.com/@b/tCngVjkSFOrCbwb9DnY2yw

And, for comparison, here's the original: https://www.amazon.com/gp/customer-reviews/R3EF0QW6MAJ0VP

It'd be nice if I could edit out the extra junk near the top, but the important bits are all there.

reply
ashildr
1 day ago
[-]
I was about to post a link to the same URL but archived using singleFile, which looks like the original at amazon. I didn‘t because I realized that I have absolutely no idea what additional information would be hidden in the file. In the worst case any component sent by Amazon and archived into the file may contain PII, even if I am “logged out“.

I‘m not saying that singleFile is bad in any way, I‘m using it a lot on multiple devices, but I‘m not sure whether sharing archives is a good idea™.

reply
nikisweeting
1 day ago
[-]
100%, this is the challenge of archiving logged in content.

It becomes un-shareable unless we use fake burner accounts for capture, or have really good sanitizing methods.

reply
ashildr
1 day ago
[-]
Even when I‘m logged out I expect at least information on my geographical location to seep into the archive via URLs addressing specific CDN endpoints or similar mechanisms.
reply
nikisweeting
1 day ago
[-]
Yup, this is why the ArchiveBox browser extension sends URLs to a separate server for archiving with an isolated burner profile.

I should write a full article on the security implications at some point, there aren't many good top-down explanations of why this is a hard problem.

reply
ashildr
1 day ago
[-]
I know it’s a lot of work but this would be great and it may give readers a deeper understanding into security in general.
reply
ninalanyon
1 day ago
[-]
How does it save pages that are only available when you are logged in such as social networking pages?
reply
nikisweeting
21 hours ago
[-]
You set up a chrome profile for archiving that's logged into all the sites you want to save. I recommend using burner accounts dedicated to archiving, so you'd have to add them to any private pages/groups you want to archive.

It is possible to use your main account for archiving but there are security risks (you cant share the snapshots without leaking session headers).

reply
pbronez
2 hours ago
[-]
That’s a very cool solution- gives the user explicit control
reply
nikisweeting
1 day ago
[-]
I love ArchiveTeam warrior, it's such a good idea! We run several instances ourselves, and it's part of our Good Karma Kit for computers with spare capacity: https://github.com/ArchiveBox/good-karma-kit

There are a bunch of other alternatives like ReadDeck listed on our wiki too, we encourage people to check it out!

https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...

reply
ninalanyon
1 day ago
[-]
I've just tried Readeck and it doesn't save a good quality copy of the pages using the Firefox extension. SingleFile and WebScrapBook do a much better job.

I prefer WebScrapBook because it saves all the assets as files under the original names in a directory rather than a zip file. This means that I can use other tools such as find, grep, and file managers like Nemo to search the archive without needing to rely on the application that saved the page.

reply
404mm
1 day ago
[-]
Somewhat similar topic, anyone has recommendations for a self-hosted internet website change monitoring system? I’ve been running Huginn for many years and it works well; however, I have a feeling the project is on its last leg. Also, it’s based on either text scraping (XPath/CSS/HTML and rss but it struggles with newer JS-based sites.
reply
pabs3
1 day ago
[-]
I recommend urlwatch, you run it from a terminal and send the output wherever you want, such as email via cron.

https://thp.io/2008/urlwatch/

reply
nikisweeting
1 day ago
[-]
Changedetection.io
reply
404mm
1 day ago
[-]
Thank you! That looks great!
reply
arminiusreturns
1 day ago
[-]
Why do you feel like Huginn is on its last leg? It's been in my list of things to play with for years now, but I never got around to it...
reply
404mm
1 day ago
[-]
It looks like it’s being maintained by a single remaining developer. No new features are being added, just some basic maintenance. The product as a whole still works well, so unless you find something better, I do recommend it. I run it in k3s and the image is probably the easiest way of maintaining it.
reply
favorited
1 day ago
[-]
As someone who was archiving a doomed website earlier today using wget, I was reminded that really need to get ArchiveBox working...

I used to rely on my Pinboard subscription, but apparently archive exports haven't worked for years, so those days are over.

reply
VTimofeenko
1 day ago
[-]
I recently found omnivore.app through HN comments -- works great for sharing a reading list across machines. I am exporting articles through obsidian, but there is an API option. I don't think it supports outbound RSS, but they have inbound RSS(i.e. omnivore as RSS reader) in beta.
reply
nikisweeting
1 day ago
[-]
Pocket also doesn't offer archived page exports (or even RSS export). I feel like both are really dropping the ball in this area!
reply
pronoiac
1 day ago
[-]
Oh, writing my own Pinboard archive exporter is somewhere on my too-long to-do list. I should find out what would be good for importing into Archivebox. (WARC?)
reply
rcarmo
1 day ago
[-]
This is nice. I'm actually much more excited about the REST API (which will let me do searches and pull information out, I hope) than the plugin ecosystem, since the last thing I need is for another tool to have a half-baked LLM integration -- I prefer to do that myself and have full control.

Being able to do RAG on my ArchiveBox is something that I have very much wanted to do for over a year now, and it might finally be within reach without my going and hacking at the archived content tree...

Edit: Just looked at the API schema at https://demo.archivebox.io/api/v1/docs.

No dedicated search endpoint? This looks like a HUGE missed opportunity. I was hoping to be able to query an FTS index on the SQLlite database... Have I missed something?

reply
nikisweeting
1 day ago
[-]
The /cli/list endpoint is the search endpoint you're looking for. It provides FTS but I can make it clearer in the docs, thanks for the tip.

As for the AI stuff don't worry, none of it is touching core, it's all in an optional community plugin only for those who want it.

I'm not personally a huge AI person but I have clients who are already using it and getting massive value from it, so it's worth mentioning. (They're doing some automated QA on thousands of collected captured and feeding results into spreadsheets)

reply
rcarmo
1 day ago
[-]
Thanks, I'll have a look.

My use for this is very different--I want to be able to use a specific subset of my archived pages (which is mostly reference documentation) to "chat" with, providing different LLM prompts depending on subset and fetching plaintext chunks as reference info for the LLM to summarize (and point me back to the archived pages if I need more info).

reply
nikisweeting
1 day ago
[-]
Ok that makes sense, I think archivebox works as the first step in a pipeline there, with some other tool doing the LLM analysis and query stuff.
reply
rcarmo
1 day ago
[-]
Yep. That's what I've built for myself, I just can't really get at the data inside ArchiveBox until I upgrade.
reply
pbronez
1 hour ago
[-]
How did you build it?

I can imagine an architecture where I throw everything into ArchiveBox, then run VectorDB as a plugin with Gradio or some such as the client.

https://vectordb.com/

reply
sunshine-o
1 day ago
[-]
I have been using ArchiveBox recently and love it.

About search, one thing I haven't yet figured out how to do easily is to plug it to my SearXNG instance as they only seem to support Elasticsearch, Meilisearch or Solr [0]

So this new plugin architecture will allow for a meilisearch plugin I guess (with relevancy ranking).

- [0] https://docs.searxng.org/dev/engines/offline/search-indexer-...

reply
nikisweeting
1 day ago
[-]
Definitely doable! Search plugins are one of the first that I implemented.

We already provide Sonic, ripgrep, and SQLiteFTS as plugins, so adding something like Solr should be straightforward.

Check out the existing plugins to see how it's done: https://github.com/ArchiveBox/ArchiveBox/pull/1534/files?fil...

archivebox/plugins_search/sonic/*

reply
orblivion
1 day ago
[-]
Have you (and I wonder the same about archive.org) considered making a Merkle tree of the data that gets archived? Since data (including photos and videos) are getting easier to fake, it may be nice to have a provable record that at least a certain version of the data existed at a certain time. It would be most useful in case of some sort of oppressive regime down the line that wants to edit history. You'd want to publish the tip somewhere that records the time, and a blockchain seems to make the most sense to me but maybe you don't like blockchains.
reply
nikisweeting
1 day ago
[-]
Yup, already doing that in the betas. Thats what I'm referring to as the beginnings of a "content addressable store" in the article.

In the closed source fork we currently store a merkle tree summary of each dir in a dotfile containing the sha256 and blake3 hash of all entries / subdirs. When a result is "sealed" the summary is generated, and the final salted hash can be submitted to Solana or ETH or some other network to attest to the time of capture and the content. (That part is coming via a plugin later)

reply
zvr
1 day ago
[-]
You might be interested in taking a look at SWHID (Software Hash IDentifiers), which defines a way (on its way to become an ISO standard) to reference files and directories with content-based identifiers, like swh:1:dir:d198bc9d7a6bcf6db04f476d29314f157507d505. Yes, it uses Merkle trees for filesystem hierarchy. https://www.swhid.org/specification/v1.1/5.Core_identifiers/
reply
orblivion
1 day ago
[-]
Wow that's great!
reply
beefnugs
1 day ago
[-]
Not just all that nonsense, but also it makes a lot of sense to share just the parts from a website that matter like a single video etc without having to download an entire archive or the rest of the site
reply
nikisweeting
1 day ago
[-]
$ archivebox add --extractor=media,readability https://...

We try to make that easy by allowing ppl to select one or more specific archivebox extractors when adding, so you don t have to archive everything every time.

Makes it more useful for scraping in a pipeline with some other tools.

reply
pabs3
1 day ago
[-]
Unfortunately ArchiveBox uses wget, so it produces non-standard WARC files. Sadly there are lots of things like this in the WARC ecosystem.

https://wiki.archiveteam.org/index.php/The_WARC_Ecosystem

reply
nikisweeting
1 day ago
[-]
Yes, this is true currently. If you need nice WARCs I recommend Browsertrix by our friends at Webrecorder instead.

Its on my roadmap to improve this eventually, but currently I'm focused on saving raw files to a filesystem, because it's more accessible to most users, and easier to pipe into other tools.

I encourage people to use ZFS to do deduping and compression at the filesystem layer.

reply
TheTechRobo
1 day ago
[-]
Browsertrix (and Webrecorder tools in general) also violate the standard by modifying response data. It's supposed to be the raw bytes as they are sent over the network (minus TLS).

The entire WARC ecosystem is kind of a mess.

reply
ikreymer
1 day ago
[-]
This isn't really true, our tools do not just modify response data for no reason!

Our tools do the best that we can with an old format that is in use by many institutions. The WARC format does not account for H2/H3 data, which is used by most sites nowadays.

The goal of our (Webreocrder) tools is to preserve interactive web content with as much fidelity as possible and make them accessible/viewable in the browser. That means stripping TLS, H2/H3, sometimes forcing a certain video resolution, etc.. while preserving the authenticity and interactivity of the site. It can be a tricky balance.

If the goal is to preserve 'raw bytes sent over the network' you can use Wireshark / packet capture, but your archive won't necessarily be useful to a human.

reply
CorentinB
1 day ago
[-]
He didn't say you modify the data for no reason, he said you violate the standard. Which is true. You could respect it, but you don't.
reply
nikisweeting
1 day ago
[-]
imo the Webrecorder stuff is truly state of the art, if they're pushing the limits of WARC standards it's for good reason, and I trust their judgement. They pioneered the newer WACZ standard and are really pushing the whole field forward.
reply
grinch5751
1 day ago
[-]
This looks like a really wonderful set of developments. Already making plans to use an old laptop of mine as an achivebox machine.
reply
wongarsu
1 day ago
[-]
Does this mean it's now possible to write plugins that dismiss cookie popups, solve captchas, scroll web pages etc.?
reply
nikisweeting
1 day ago
[-]
I have a private plugin with puppeteer support for stuff like this, currently charging clients money to use it to fund the open source development. The clients are people who are already legally allowed to evade CAPTCHAS (e.g. governments, NGOs doing research, lawyers collecting evidence, etc.)

Unfortunatley I cant open source the CAPTCHA solving stuff myself, because it opens me up to liability, but if someone wants to contribute a plugin to the ecosystem I cant stop them ;).

reply
0x1ch
1 day ago
[-]
Legally allowed to evade CAPTCHAs? LOL.

What world do we live in where evading a captcha is an illegal offense?

reply
nikisweeting
1 day ago
[-]
It doesn't matter whether or not it's actually legal, what matters is that the big platforms will sue you for trying, so you need a big bankroll to stand your ground.

At the very least they can bar you from accessing their sites as you're violating ToS that you accept upon signup.

reply
xiconfjs
1 day ago
[-]
You mean ArchiveBox still doesn’t deal with cookie popups? If so, it’s quasi not useful for EU based web sites.
reply
nikisweeting
1 day ago
[-]
It does, you just have to set up a chrome profile that has an extension to hide cookie popups, or use a profile where you've already accepted/closed them and have a session.

You can archive with any chrome profile with arbitrary extensions enabled, so you can use uBlock, I still Don't care about cookies, Ghostery, etc.

reply
ajvs
1 day ago
[-]
How do you set this up? I found this relevant issue[1] but it doesn't explain how to get it working.

[1] https://github.com/ArchiveBox/ArchiveBox/issues/211

reply
sagz
1 day ago
[-]
Do y'all support archiving pages that are behind logins? Like using browser cookies?
reply
markerz
1 day ago
[-]
Yes, but there's security concerns where you might accidentally leak your credentials / cookies if you publish your archive to the public.

https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...

PS. I'm an archivebox user, not a dev or maintainer.

reply
nikisweeting
1 day ago
[-]
Yes this is correct, with plans to make this easier in the near future via setup wizard that guides you through creating dedicated credentials for archiving.
reply
millvalleydev
1 day ago
[-]
For devs like us, archivebox? or browsertrix-crawler? for scraping entire sites for our own uses, maybe to keep contents behind pay walls while we have subscriptions or maybe to feed them to local LLMs to ask?
reply
nikisweeting
1 day ago
[-]
For scraping entire sites browserteix is currently more suited until we add full depth recursive crawling in v0.9. For feeding to LLMs ArchiveBox MIGHT BE better (imho) because we extract the raw content and you likely don't need the whole WARC.
reply
dewey
1 day ago
[-]
I've tried to get started with ArchiveBox many times, but it was always quite buggy (Not working in Safari, a bit clunky to run,...) but I've noticed a lot of updates in the past months so I'm excited about it moving forward and giving it another shot.
reply
joeross
1 day ago
[-]
I have no programming skill at all and I don’t know a ton about ArchiveBox except I set it up and ran it for myself for a while, so I’m asking as an innocent, ignorant and curious geek, but is this something that could be adapted to peer to peer distribution or some other means of making it simultaneously as private and local as you want it and as distributed and bulletproof, uptime wise, as possible?
reply
agnishom
1 day ago
[-]
This is interesting. I personally use Omnivore + backup to Obsidian for this purpose.
reply
petertodd
1 day ago
[-]
You really should add timestamping to ArchiveBox. The easiest way to do that would be via my OpenTimestamps protocol, https://opentimestamps.org It's open source and free to use, and uses Bitcoin for the actual timestamps. Users of it do not need to make Bitcoin transactions themselves as a set of community calendar servers do that for you. You also don't need a Bitcoin node to create an OTS timestamp, and you can validate an OTS timestamp without a Bitcoin node as well by trusting someone else to do that for you.

The big thing that ArchiveBox can't do, and the Internet Archive can, is attest to the accuracy of the archive. Being at least able to prove that the archive was created in the past, prior to there being a reason to tamper it, is the best we can realistically do with current cryptography. So it'd be really good if support for timestamping was added.

IIUC ArchiveBox is written in Python; OTS has a Python library that should work fine for you: https://github.com/opentimestamps/python-opentimestamps

reply
nikisweeting
1 day ago
[-]
We're going to add TLSNotary support for real cryptographic signing, see my comments below :)

Timestamping is also on my roadmap, definitely as a plugin (and likely paid) as it's more corporate users that really need it. We need to keep some of the really advanced attestation features paid to be able to support the rest of the business.

reply
petertodd
1 day ago
[-]
> We're going to add TLSNotary support for real cryptographic signing, see my comments below :)

Last I checked TLSNotary requires a trusted third party. I would strongly suggest timestamping TLSNotary evidence, to be able to prove that evidence was created prior to any of these trusted third parties being compromised.

reply
nikisweeting
1 day ago
[-]
Of course, TLSNotary stuff would necessarily come with a whole ecosystem, including some sort of transparency log like certificate transparency logs, DNS record keeping, timestamping, etc.

But we'll start with the basics and work our way up to completeness.

reply
mikae1
1 day ago
[-]
Thanks for the box!

Any examples of other possible really advanced features that might go for-pay?

Is there any chance you will make current free features for-pay? That'd be rather off-putting for me as a home user.

reply
nikisweeting
1 day ago
[-]
No, everything currently free will stay free.

The paid stuff currently is:

- per-user permissions & groups

- audit logging

- auto CAPTCHA solving

- burner credential management for FB/Insta/Twitter/etc. w/ auto phone based account verification ability

- custom JS scripts for expanding comments, hiding pop ups, etc.

- managed hosting + support

Some of this stuff ^ is going to become free in upcoming releases, some will stay paid. What I decide to make free is mostly based on abuse potential and legal ramifications, I'd rather have a say in how the risky stuff is used so that it doesn't become a tool weaponized for botting.

reply
mikae1
1 day ago
[-]
Thanks for the clarification and thanks again for the great work!
reply
jasonfarnon
1 day ago
[-]
I always wonder about this when someone gets in hot water based on something on the wayback machine and the person says the archive was tampered with. Can you elaborate on "prove that the archive was created in the past, prior to there being a reason to tamper it"? What exactly does opentimestamps certify?
reply
nikisweeting
1 day ago
[-]
OpenTimestamps alone can not currently prove anything because TLS session keys are symmetric. The client can forge anything and attest to it falsely. Unless you 100% trust the archiver (in which case you can trust their timestamps), you need TLSNotary or another reputable third party in the loop as a bare minimum.

But more critically: currently the legal standard for evidence is... screenshots. We have a lot of educating work to do before the public understands the value of attestation and signing.

reply
petertodd
1 day ago
[-]
> OpenTimestamps alone can not currently prove anything because TLS session keys are symmetric.

Timestamps can prove that the data existed prior to there being a known reason to modify it. While that's not as good as direct signing, that's often still enough to be very useful. The statement that OTS "can not currently prove anything" is incorrect.

A really good example of this is the Hunter Biden email verification. I used OpenTimestamps to prove that the DKIM key that signed the email was in fact used by Google at the time, by providing a Google-signed email that had been timestamped years ago: https://github.com/robertdavidgraham/hunter-dkim/tree/main/o...

That's convincing evidence, because it's highly implausible that I would have been working to fake Hunter's emails years before they even came up as an election issue.

reply
nikisweeting
1 day ago
[-]
Ok, fair point, they prove that content existed at some point in time, which is useful sometimes. But I don't want people to over-rely on that as "good enough", we can do much better, it's too low a bar for a whole ecosystem of archiving to rely on when we now have a viable solution to fix it (TLSNotary or others).
reply
treyd
1 day ago
[-]
Is this a project that could be developed to support a distributed mirror of archive.org similar to how Anna's Archive works?
reply
nikisweeting
1 day ago
[-]
Yeah that's what we're aiming for eventually, but with the addition of fine-grained permissions controls so you don't have to share everything 100% publically, you can choose a subset.

https://github.com/ArchiveBox/ArchiveBox/wiki/Roadmap

reply
chillfox
1 day ago
[-]
Awesome, I am really looking forward to the new api and plugins.

I have been running an instance for almost 2 years now that I use for archiving articles that I reference in my notes.

reply
newman314
1 day ago
[-]
@nikisweeting Is abx-dl already available or is it coming? I took a quick dive and didn't see a repo under the org.

I'm happy to help package this up once it is available.

reply
nikisweeting
1 day ago
[-]
Not currently available, it should be out soon after v0.9 is released.

Currently `mkdir tmp_data && cd tmp_data; archivebox install; archivebox add ...` is effectively equivalent to what `abx-dl` will do.

reply
rodolphoarruda
1 day ago
[-]
> "In an era where fear of public scrutiny is very tangible, people are afraid of archiving things for eternity. As a result, people choose not to archive at all, effectively erasing that history forever."

Really? I don't get that feeling at all. I use Evernote to archive anything I consider worth keeping. I wonder where such "fear of archiving" comes from.

reply
nikisweeting
1 day ago
[-]
A lot of people are retreating off public free-for-all platforms like Twitter to more siloed spaces like Discord, for many reasons, not just fear of archiving.

It all has the same effect of making it harder to archive though.

reply
hooverd
21 hours ago
[-]
Does anyone have recommendations for the hardware side of self-hosting something like that? How do I avoid bit rot?

Also I see django-ninja! Very cool.

reply
A4ET8a8uTh0
1 day ago
[-]
Those additions are welcome, but if I could request one -- I and one that it is very consistently requested -- feature:

- backing up an entire page

Yes, it is hard. Yes, for non-pure html pages is extra kind of painful, but that would honestly making archivebox go from nice to have to.. yes, I have an actual archive I can use when stuff goes down.

reply
nikisweeting
1 day ago
[-]
Do you mean backing up an entire domain? Like example.com/*

If so that's starting to roll out in v0.8.5rc50, check out the archivebox/crawls/ folder.

If you mean archiving a single page more thoroughly, what do you find is missing in Archivebox? Are you able to get singlefile/chrome/wget html when archiving?

reply
A4ET8a8uTh0
1 day ago
[-]
Edit: The first option. ( previous stuff removed )

Lemme check my current version ( edit: 0.7.2 -- ty, I will update and test soon :D)

reply
nikisweeting
1 day ago
[-]
Ah ok. One caveat: it's only available via the 'archivebox shell' / Python API currently, the CLI & web UIs for full depth crawling will come later.

You can play around with the models and tasks, but I would wait a few weeks for it to stabilize and check again, it's still under heavy active development

Check archivebox/archivebox:dev periodically

reply
A4ET8a8uTh0
1 day ago
[-]
No worries. I can do that.

You guys probably hear it all the time, but you are doing lords work. If I thought I could be of use in that project, I would be trying to contribute myself ( in fact, let me see if there a way I can participate in a useful manner ).

reply
nikisweeting
1 day ago
[-]
Thanks! I love working on archiving so far, and it's been very motivating to see more and more people getting into archiving lately.
reply
dark-star
1 day ago
[-]
Some time ago I installed ArchiveBox on a RaspberryPi 4 running k3s (a lightweight Kubernetes distro).

I have documented that here: https://darkstar.github.io/2022/02/07/k3s-on-raspberrypi-at-...

Note that this was a rather old version and some things have probably changed compared to now, so YMMV, but it might still provide a good reference for those who want to try

reply
nikisweeting
1 day ago
[-]
Thanks for making that tutorial!

Happy to report that most of the quirks you cover have been improved:

- uid 999 is no longer just enforced, you can pass any PUID:GUID now (like Linuxserver.io containers)

- it now accepts ADMIN_USERNAME + ADMIN_PASSWORD env vars to create an initial admin user on first start without having to exec

- archivebox/archivebox:latest is 0.7.2 (yearly stable release) and :dev is the 0.8.x pre-release updated daily. All Images are all amd64 & arm64 compatible.

- singlefile and sonic are now included in all images & available on all platforms amd64/arm64

reply
dark-star
1 day ago
[-]
yeah I really need to update that guide. Since I published it I have updated ArchiveBox locally to a newer version but never bothered to update the guide :)
reply
Acrobatic_Road
1 day ago
[-]
The subline mentions "Auto-login", but the article never elaborates on this. Does this mean we will be able to more easily archive non-public websites?

Also, how do you plan to ensure data authenticity across a distributed archive? For example, if I archive someone's blog, what is stopping me from inserting inflammatory posts that they never wrote, and passing them off as the real deal? Slight update: I see you're using TLS Notary! That's exactly what I would have suggested!

reply
nikisweeting
1 day ago
[-]
Auto log in is currently a service I provide for paying clients, and you can do it in the open source version manually with some extra config.

Working hard on making it more accessible in the future, and plugins should help!

reply
FiniteField
1 day ago
[-]
Disappointing that a project that should ostensibly care about preserving the open, non-centralised internet takes the time to namedrop and talk about making "compromises" against preserving a well-known, medium-sized clearnet forum legally operated from a US-based LLC. Still-living independent forum sites in this day and age have unrivalled SNR of actual human-to-human communication, there should be no better candidate for archival. It's sad that a self-hosted archival tool has to apologise for any "evil" content it might be used for in the first place. Tape recorders do not require a disclaimer about people saying "hate speech" into them.
reply
nikisweeting
1 day ago
[-]
Sorry which medium sized forum are you referring to?

I love forums and want them to continue, I'm not sure where you got the idea that I dislike them as a medium. I was just pointing out that public sites in general have started to see some attrition a bit lately for a variety of reasons, and the tooling needs to keep with new mediums as they appear.

I also make no apology for the content, in fact ArchiveBox is explicitly designed to archive the most vile stuff for lawyers and governments to use for long term storage or evidence collection. One of our first prospective clients was the UN wanting to use it to document Syrian war crimes. The point there was that we can save stuff without amplifying it, and that's sometimes useful in niche scenarios.

Lawyers/LE especially don't want to broadcast to the world (or tip off their suspect) that they are investigating or endorsing a particular person, so the ability to capture without publicly announcing/mirroring every capture is vital.

reply
dark-star
1 day ago
[-]
I guess he's talking about K_wi F_rms which was mentioned in one of the screenshots...
reply
71bw
5 hours ago
[-]
It's just a forum like any other and yet you're acting like it's, at least, the Devil 2.0.
reply
nikisweeting
1 day ago
[-]
Ahh that makes sense. Well all I can say to that is that it's not up to me what's evil. The point I was trying to make is: sometimes you want to archive something that you don't endorse / don't want to be publicly linked.

You might not want to amplify and broadcast the fact that you're archiving it to the world.

reply
the_gorilla
1 day ago
[-]
I don't know how anyone manages to use archivebox. I've tried it twice in the last 3 years and its site compatibility is bad, it quietly leaks everything you archive to archive.org by default, and whenever it fails on a download it stops archiving anything even after deleting and resubmitting all the jobs.

I'm sure it works for some people, but not me.

reply
nikisweeting
1 day ago
[-]
These are legitimate gripes that have plagued specific past releases, I hear your frustration. Please keep in mind this was a solo effort of a single developer, only worked on in my spare time over the last 7 years (up until very recently).

The new v0.8 adds a BG queue specifically to deal with the issue of stalling when some sites fail. There was a system to do this in the past, but it was imperfect and mostly optimized for the docker setup where a scheduler is running `archivebox update` every few hours to retry failed URLs.

Site compability is much improved with the new BETA, but it's a perpetual cat and mouse game to fix specific sites, which is why we think the new plugin system is the way forward. It's just not sustainable for a single company (really just me right now) to maintain hundreds of workarounds for each individual site. I'm also discussing with the Webrecorder and Archive.org teams how we can to share these site-specific workarounds as cross-compatible plugins (aka "behaviors") between our various software.

> it quietly leaks everything you archive to archive.org by default

It's prominently mentioned many times (at least 4) on our homepage that this is the default, and archiving public-only sites (which are already fair game for Archive.org) is a default for good reason. Archiving private content requires several important changes and security considerations. More context: https://news.ycombinator.com/item?id=26866689

reply
freedomben
1 day ago
[-]
Yeah, I'm not sure whether archive.org should be defaulted to on or off (I see both sides of that one), but its existence is definitely surfaced.

I love Archive Box btw, thank you for your effort! It's filling a very important need.

reply
the_gorilla
1 day ago
[-]
I can accept the other issues, but archivebox needs be private and secure by default.

Sending everything to archive.org is bad default value and it erodes a certain level of trust in the project. Requiring "several important changes and security considerations" just makes a non-starter. The default settings should be "safe" for the default user, because as you mentioned in that post, 90% of users are never going to change them. Users should be able to run it locally and archive data without worrying about security issues, unless you only want experts to be able to use your software.

Also a contradiction between your statement and your blogpost, someone saving their photos isn't going to be want to worry about whether they configured your tool correctly or leaking all the group logs or grandma's photos.

>It's prominently mentioned many times (at least 4) on our homepage that this is the default, and archiving public-only sites (which are already fair game for Archive.org) is a default for good reason. Archiving private content requires several important changes and security considerations. More context

> Who cares about saving stuff?

> All of us have content that we care about, that we want to see preserved, but privately:

> families might want to preserve their photo albums off Facebook, Flickr, Instagram

> individuals might want to save their bookmarks, social feeds, or chats from Signal/Discord

> companies might want to save their internal documents, old sites, competitor analyses, etc.

I want the project to do well but it really needs to be secure by default.

reply
nikisweeting
1 day ago
[-]
> The default settings should be "safe" for the default user,

I 100% agree, but because private archiving is doable but NOT 100% safe yet I cant make that mode the default. The difficult reality currently is that archiving anything non-public is not simple to make safe.

Every capture will contain reflected session cookies, usernames, and PII, and other sensitive content. People don't understand that this means if they share a snapshot of one page they're potentially leaking their login credentials for an entire site.

It is possible to do safely, and we provide ways to achieve that that I'm constantly working on improving, but until it's easy and straightforward and doesn't require any user education on security implications, I cant make it the default.

The goal is to get it to the point where it CAN be the default, but I'm still at least 6mo away from that point. Check out the archivebox/sessions dir in the source code for a look at the development happening here.

Until then, it requires some user education and setting up a dedicated chrome profile + cookies + tweaking config to do. (as an intentional barrier to entry for private archiving)

reply
arboles
19 hours ago
[-]
I don't think it's possible to remove information about yourself from a webpage before you share it. It's always possible to have crafted a website that sneaks reflected session information or the instance of archivebox's IP address into the main content. This can be a real response:

> And that was this week's newsletter! Congratulation for reading to the bottom, dear 198.51.100.1.

Even if the archivebox instance noted its own IP to do a search-and-replace like s|198\.51\.100\.1|XXX.XXX.XXX.XXX| on the snapshot it is about to create, it's possible to craft a response that obscures the presence of the information, such as by encoding the IP like this: MTk4LjUxLjEwMC4xCg==. I.e. steganography (https://en.wikipedia.org/wiki/Steganography).

Being able to anonymize archives before sharing them is something I would find interesting, but I don't think you can beat steganography, so I'm wondering what exactly you mean you plan to do.

reply
bigiain
1 day ago
[-]
That's a really good response, thanks.

I've been very impressed by all of your responses in here, but that one in particular shows empathy, compassion, and a deep deep subject matter expertise.

reply
nikisweeting
1 day ago
[-]
Thank you. And thank you for taking the time to read all of it, there's a lot of great questions being asked.
reply
Apocryphon
1 day ago
[-]
Perhaps this data is "private" as in "personal property" and not "private" as in "confidential."
reply
nikisweeting
1 day ago
[-]
It's intended for both but it currently requires extra setup to do "confidential" because there are security risks.
reply
hobs
1 day ago
[-]
As a custom tool built to archive stuff for archive.org, why would you expect that it can also do a completely opposite task, saving information privately?

I can see why you would want such a tool, but it seems like a direct divergence from the core goal of the existing codebase.

reply
arboles
18 hours ago
[-]
> As a custom tool built to archive stuff for archive.org

Archivebox has no association with archive.org. Sending URLs to archive.org is just one of its features, which can also be turned off.

reply
the_gorilla
1 day ago
[-]
[flagged]
reply
dang
1 day ago
[-]
We've banned this account for breaking the site guidelines. Please don't create accounts to break HN's rules with.

https://news.ycombinator.com/newsguidelines.html

reply