I've posted issues and PRs for showstopper issues that took months to get merged in: https://github.com/ArchiveBox/ArchiveBox/issues/991 https://github.com/ArchiveBox/ArchiveBox/pull/1026
You have the opportunity for the community to lean in on ArchiveBox. I understand it's hard to do everything as a solo dev, we've seen many cases in the community where solo devs get burned out or have personal challenges that take priority etc.
It's hard for us users to lean in on ArchiveBox when after a happy month of archiving, things start break and you're left with maintaining a branch of your own fixes that aren't in main. Meanwhile, your solution of soliciting one time donations just makes the whole project feel more rickety and fly-by-night. How about thinking bigger?
We NEED ArchiveBox to be a real thing. Decentralized tooling for archiving is SO IMPORTANT. I care about it and I suspect many people do. I'm posting this so other people who care about it can also comment and chime in and suggest how it can become something we can rely on. Because archiving isn't just about the past, it's about the future.
Maybe it needs to be a dev org of three committed part-time maintainers, and a small foundation that people recurrently support is what grants it? IDK. I'm not an expert at how to make open source resilient. There have been discussions about this in the past, but I think it's worth a serious look because ArchiveBox is IMPORTANT and I want it to work any month I decide to re-activate my interest in it. I invite people to discuss ways to make this valuable project more sustianable and resilient.
It's only been during the last few months that I decided to go all in on the project, so this is still just the first few pages of a new chapter in the project's history.
(I should also mention that if you're a commercial entity relying on ArchiveBox, you can hire us for dedicated support and uptime guarantees. We have a closed source fork that has a much better test suite and lots of other goodies)
Selling custom integrations, managed instances, white-glove support with an SLA, and so on seems like a reasonable funding model for a project based on an open-source, self-hostable platform. But I'm a little disheartened to read that you're maintaining a closed fork with "goodies" in it.
How do you decide which features (better test suite?) end up in the non-libre, payware fork of your software? If someone contributed a feature to the open-source version that already exists in the payware version, would you allow it to be merged or would you refuse the pull request?
The marketplace will work by showing all git repos tagged with the "archivebox" tag on github.
My approval is only needed for PRs to the archivebox core engine.
More info on free vs paid + reasoning why it's not all open source: https://news.ycombinator.com/item?id=41863539
I'd love to see blockchain proof/notary support. The ability to say "content matching this hash existed at this time.
I'm exceptionally busy now but that being said, I may choose to contribute nonetheless.
I'd love to connect directly, and will connect to the Zulip instance later.
If we align on values, I may be able to connect you with some cash. People often call me an "anarchist" or "libertarian", though I'm just me, not labels necessary.
If when I initially archived it, I computed a cryptographic hash of the content and posted that on a blockchain, then at a future date I can at least claim "As of block N, approximately corresponding to this time UTC, content that hashes to this hash exited."
If multiple unrelated parties also make the same claim, it is stronger evidence.
Is this sufficient explanation? I can expand on this more later.
Internet archive as an institution with a reputation that holds up to a judge is actually more valuable than a cryptographic proof that x bytes existed at y time
The more people I hear making a claim, the more I'm likely to deem the claim(s) as true. This is even true regarding the claims that cryptographic algorithms have the properties that make them useful in these contexts. I say this as someone who has even taken graduate level classes with Ron Rivest.
I'm not sure what will happen in a court. I imagine the more people that start making claims using cryptography as part of the supporting evidence, the more likely people will start to trust cryptography as a useful tool for resolving disputes about the veracity of claims.
So you would not get any value from multiple people making such claims?
I bring up judges because Internet archive captures have been used as evidence in court cases, the first one I pulled up [0] makes an interesting distinction on whether the archive's snapshots are merely hearsay:
The hearsay rule does not apply to the document (so far as it contains the representation) if the representation was made:
(a) by a person who had or might reasonably be supposed to have had personal knowledge of the asserted fact; or ...
The archive's office manager submitted an affidavit to the court as someone who would have personal knowledge of the fact that the date and claimed availability of the content are accurate. There's no cryptography involved, just an individual and an institutions reputation - this carries much more weight than any number of anonymous individuals attesting to a cryptographic proof[0] https://www.judgments.fedcourt.gov.au/judgments/Judgments/fc...
See my comments on TLSNotary stuff below...
I think I read a paper or blog post about this concept a while ago, but never saw it implemented!
You might support cryptographically signing WARC archives; Wayback is particular about archive provenance and integrity, for example.
https://www.loc.gov/preservation/digital/formats/fdd/fdd0005... ("CDX Internet Archive Index File")
https://www.loc.gov/preservation/digital/formats/fdd/fdd0002... ("WARC, Web ARChive file format")
https://github.com/internetarchive/wayback/tree/master/wayba... ("Wayback CDX Server API - BETA")
ArchiveBox is also gearing up to support real cryptographic signing of archives using https://tlsnotary.org/ in an upcoming plugin. (in a way that actually solves the TLS non-repudation issue, which traditional "signing a WARC" does not, more info: https://www.ndss-symposium.org/wp-content/uploads/2018/02/nd...)
This dramatically changes what is possible with signing because previously to get ArchiveTeam's signature of approval, they would have to see the content themselves to archive it. With TLSNotary they can sign without needing to see the content/access the cookies/etc.
Look up "TLS non repudiation"
A real solution like TLSNotary involves a neutral, reputable third party that can't see the cleartext attesting to the cyphertext using a ZK proof.
The neutral third party doing attestation can't see the content so they can't easily tamper with it, and attempts to tamper indiscriminately would be easily detected and ding their reputation.
We do the best we can with a limited standard that is difficult to modify. Archiving is always lossy, we try to reduce that as much as possible, but there are limits. People create web archives because they care about not losing their stuff online, not because they need an accurate record of transfer-encoding property in an HTTP connection. If storing the transfer-encoding is the most important thing, then yes, there are better tools for that.
"Archiving is always lossy" No.
I think he has more context than any of us on the limits of proxy archiving vs browser based archiving.
But also if you really need perfect packet-level replication, just wireshark it as he said. Why bother with WARCs at all?
The web is best-effort, and so is archiving the web.
BTW I recently tried using IPFS for a mutable public storage bucket and that didn't go well - downloads were very slow compared to torrents, and IPNS update propagation took ages. Perhaps torrents will do the job.
The Internet Archive serves a torrent file for every item they host
I had no idea. I have found the IA serving speed to be pretty terrible. Are the torrents any better? Presumably the only ones seeding the files are IA themselves.You see, I inherited a boat.
This boat belonged to my father. He was not materialistic but he took very good care of the things he cared about, and he cared about this boat. It's an old 18' aluminum fishing/cruising boat built in the early 1960's. It's not particularly valuable as a collectible but it is fairly rare and has some unique modifications. I spent a lot of time trying to dig up all of the info that I could on it, but this is one of those situations where most of the companies involved have been gone for decades and most everyone who was around when these were made are either dead or not really on the Internet.
It's a shame that I waited so long to start my research because 10 or 20 years ago, there were quite a few active web forums containing informational/tutorial threads from the proud owners of these old boats. I know because I have seen references to them. Some of the URLs are in archive.org, some are not. But the forums are gone, so a large chunk of knowledge on these boats is too, probably forever.
I did manage to dig up some interesting articles, pictures, and forum threads and needed a way to save them so that they didn't disappear from the web as well. There is probably an easier way to go about it, but in the end I ran ArchiveBox via Docker and set it to fetching what I could find and then downloaded the resulting pages as self-contained HTML pages.
These days, that kind of info would be locked up in a closed Discord chat somewhere, so you can forget about people 20 years from now ever seeing it.
I also run an instance of ArchiveTeam Warrior which is constantly uploading things to archive.org, and I like the direction ArchiveBox is heading with the distributed/federated archiving on the roadmap, so I may end up setting up an instance like that even if I don't use it for personal content.
But it strips out a lot of detail. Zipping it also means that it's hard to deduplicate. I use WebScrapBook and run rdfind to hardlink all the identical files.
Here's an example of an Amazon "review" I recently archived that has instructions for using a USB tester I have: https://readeck.home.nfriedly.com/@b/tCngVjkSFOrCbwb9DnY2yw
And, for comparison, here's the original: https://www.amazon.com/gp/customer-reviews/R3EF0QW6MAJ0VP
It'd be nice if I could edit out the extra junk near the top, but the important bits are all there.
I‘m not saying that singleFile is bad in any way, I‘m using it a lot on multiple devices, but I‘m not sure whether sharing archives is a good idea™.
It becomes un-shareable unless we use fake burner accounts for capture, or have really good sanitizing methods.
I should write a full article on the security implications at some point, there aren't many good top-down explanations of why this is a hard problem.
It is possible to use your main account for archiving but there are security risks (you cant share the snapshots without leaking session headers).
There are a bunch of other alternatives like ReadDeck listed on our wiki too, we encourage people to check it out!
https://github.com/ArchiveBox/ArchiveBox/wiki/Web-Archiving-...
I prefer WebScrapBook because it saves all the assets as files under the original names in a directory rather than a zip file. This means that I can use other tools such as find, grep, and file managers like Nemo to search the archive without needing to rely on the application that saved the page.
I used to rely on my Pinboard subscription, but apparently archive exports haven't worked for years, so those days are over.
Being able to do RAG on my ArchiveBox is something that I have very much wanted to do for over a year now, and it might finally be within reach without my going and hacking at the archived content tree...
Edit: Just looked at the API schema at https://demo.archivebox.io/api/v1/docs.
No dedicated search endpoint? This looks like a HUGE missed opportunity. I was hoping to be able to query an FTS index on the SQLlite database... Have I missed something?
As for the AI stuff don't worry, none of it is touching core, it's all in an optional community plugin only for those who want it.
I'm not personally a huge AI person but I have clients who are already using it and getting massive value from it, so it's worth mentioning. (They're doing some automated QA on thousands of collected captured and feeding results into spreadsheets)
My use for this is very different--I want to be able to use a specific subset of my archived pages (which is mostly reference documentation) to "chat" with, providing different LLM prompts depending on subset and fetching plaintext chunks as reference info for the LLM to summarize (and point me back to the archived pages if I need more info).
I can imagine an architecture where I throw everything into ArchiveBox, then run VectorDB as a plugin with Gradio or some such as the client.
About search, one thing I haven't yet figured out how to do easily is to plug it to my SearXNG instance as they only seem to support Elasticsearch, Meilisearch or Solr [0]
So this new plugin architecture will allow for a meilisearch plugin I guess (with relevancy ranking).
- [0] https://docs.searxng.org/dev/engines/offline/search-indexer-...
We already provide Sonic, ripgrep, and SQLiteFTS as plugins, so adding something like Solr should be straightforward.
Check out the existing plugins to see how it's done: https://github.com/ArchiveBox/ArchiveBox/pull/1534/files?fil...
archivebox/plugins_search/sonic/*
In the closed source fork we currently store a merkle tree summary of each dir in a dotfile containing the sha256 and blake3 hash of all entries / subdirs. When a result is "sealed" the summary is generated, and the final salted hash can be submitted to Solana or ETH or some other network to attest to the time of capture and the content. (That part is coming via a plugin later)
We try to make that easy by allowing ppl to select one or more specific archivebox extractors when adding, so you don t have to archive everything every time.
Makes it more useful for scraping in a pipeline with some other tools.
Its on my roadmap to improve this eventually, but currently I'm focused on saving raw files to a filesystem, because it's more accessible to most users, and easier to pipe into other tools.
I encourage people to use ZFS to do deduping and compression at the filesystem layer.
The entire WARC ecosystem is kind of a mess.
Our tools do the best that we can with an old format that is in use by many institutions. The WARC format does not account for H2/H3 data, which is used by most sites nowadays.
The goal of our (Webreocrder) tools is to preserve interactive web content with as much fidelity as possible and make them accessible/viewable in the browser. That means stripping TLS, H2/H3, sometimes forcing a certain video resolution, etc.. while preserving the authenticity and interactivity of the site. It can be a tricky balance.
If the goal is to preserve 'raw bytes sent over the network' you can use Wireshark / packet capture, but your archive won't necessarily be useful to a human.
Unfortunatley I cant open source the CAPTCHA solving stuff myself, because it opens me up to liability, but if someone wants to contribute a plugin to the ecosystem I cant stop them ;).
What world do we live in where evading a captcha is an illegal offense?
At the very least they can bar you from accessing their sites as you're violating ToS that you accept upon signup.
You can archive with any chrome profile with arbitrary extensions enabled, so you can use uBlock, I still Don't care about cookies, Ghostery, etc.
https://github.com/ArchiveBox/ArchiveBox/wiki/Security-Overv...
PS. I'm an archivebox user, not a dev or maintainer.
The big thing that ArchiveBox can't do, and the Internet Archive can, is attest to the accuracy of the archive. Being at least able to prove that the archive was created in the past, prior to there being a reason to tamper it, is the best we can realistically do with current cryptography. So it'd be really good if support for timestamping was added.
IIUC ArchiveBox is written in Python; OTS has a Python library that should work fine for you: https://github.com/opentimestamps/python-opentimestamps
Timestamping is also on my roadmap, definitely as a plugin (and likely paid) as it's more corporate users that really need it. We need to keep some of the really advanced attestation features paid to be able to support the rest of the business.
Last I checked TLSNotary requires a trusted third party. I would strongly suggest timestamping TLSNotary evidence, to be able to prove that evidence was created prior to any of these trusted third parties being compromised.
But we'll start with the basics and work our way up to completeness.
Any examples of other possible really advanced features that might go for-pay?
Is there any chance you will make current free features for-pay? That'd be rather off-putting for me as a home user.
The paid stuff currently is:
- per-user permissions & groups
- audit logging
- auto CAPTCHA solving
- burner credential management for FB/Insta/Twitter/etc. w/ auto phone based account verification ability
- custom JS scripts for expanding comments, hiding pop ups, etc.
- managed hosting + support
Some of this stuff ^ is going to become free in upcoming releases, some will stay paid. What I decide to make free is mostly based on abuse potential and legal ramifications, I'd rather have a say in how the risky stuff is used so that it doesn't become a tool weaponized for botting.
But more critically: currently the legal standard for evidence is... screenshots. We have a lot of educating work to do before the public understands the value of attestation and signing.
Timestamps can prove that the data existed prior to there being a known reason to modify it. While that's not as good as direct signing, that's often still enough to be very useful. The statement that OTS "can not currently prove anything" is incorrect.
A really good example of this is the Hunter Biden email verification. I used OpenTimestamps to prove that the DKIM key that signed the email was in fact used by Google at the time, by providing a Google-signed email that had been timestamped years ago: https://github.com/robertdavidgraham/hunter-dkim/tree/main/o...
That's convincing evidence, because it's highly implausible that I would have been working to fake Hunter's emails years before they even came up as an election issue.
I have been running an instance for almost 2 years now that I use for archiving articles that I reference in my notes.
I'm happy to help package this up once it is available.
Currently `mkdir tmp_data && cd tmp_data; archivebox install; archivebox add ...` is effectively equivalent to what `abx-dl` will do.
Really? I don't get that feeling at all. I use Evernote to archive anything I consider worth keeping. I wonder where such "fear of archiving" comes from.
It all has the same effect of making it harder to archive though.
Also I see django-ninja! Very cool.
- backing up an entire page
Yes, it is hard. Yes, for non-pure html pages is extra kind of painful, but that would honestly making archivebox go from nice to have to.. yes, I have an actual archive I can use when stuff goes down.
If so that's starting to roll out in v0.8.5rc50, check out the archivebox/crawls/ folder.
If you mean archiving a single page more thoroughly, what do you find is missing in Archivebox? Are you able to get singlefile/chrome/wget html when archiving?
Lemme check my current version ( edit: 0.7.2 -- ty, I will update and test soon :D)
You can play around with the models and tasks, but I would wait a few weeks for it to stabilize and check again, it's still under heavy active development
Check archivebox/archivebox:dev periodically
You guys probably hear it all the time, but you are doing lords work. If I thought I could be of use in that project, I would be trying to contribute myself ( in fact, let me see if there a way I can participate in a useful manner ).
I have documented that here: https://darkstar.github.io/2022/02/07/k3s-on-raspberrypi-at-...
Note that this was a rather old version and some things have probably changed compared to now, so YMMV, but it might still provide a good reference for those who want to try
Happy to report that most of the quirks you cover have been improved:
- uid 999 is no longer just enforced, you can pass any PUID:GUID now (like Linuxserver.io containers)
- it now accepts ADMIN_USERNAME + ADMIN_PASSWORD env vars to create an initial admin user on first start without having to exec
- archivebox/archivebox:latest is 0.7.2 (yearly stable release) and :dev is the 0.8.x pre-release updated daily. All Images are all amd64 & arm64 compatible.
- singlefile and sonic are now included in all images & available on all platforms amd64/arm64
Also, how do you plan to ensure data authenticity across a distributed archive? For example, if I archive someone's blog, what is stopping me from inserting inflammatory posts that they never wrote, and passing them off as the real deal? Slight update: I see you're using TLS Notary! That's exactly what I would have suggested!
Working hard on making it more accessible in the future, and plugins should help!
I love forums and want them to continue, I'm not sure where you got the idea that I dislike them as a medium. I was just pointing out that public sites in general have started to see some attrition a bit lately for a variety of reasons, and the tooling needs to keep with new mediums as they appear.
I also make no apology for the content, in fact ArchiveBox is explicitly designed to archive the most vile stuff for lawyers and governments to use for long term storage or evidence collection. One of our first prospective clients was the UN wanting to use it to document Syrian war crimes. The point there was that we can save stuff without amplifying it, and that's sometimes useful in niche scenarios.
Lawyers/LE especially don't want to broadcast to the world (or tip off their suspect) that they are investigating or endorsing a particular person, so the ability to capture without publicly announcing/mirroring every capture is vital.
You might not want to amplify and broadcast the fact that you're archiving it to the world.
I'm sure it works for some people, but not me.
The new v0.8 adds a BG queue specifically to deal with the issue of stalling when some sites fail. There was a system to do this in the past, but it was imperfect and mostly optimized for the docker setup where a scheduler is running `archivebox update` every few hours to retry failed URLs.
Site compability is much improved with the new BETA, but it's a perpetual cat and mouse game to fix specific sites, which is why we think the new plugin system is the way forward. It's just not sustainable for a single company (really just me right now) to maintain hundreds of workarounds for each individual site. I'm also discussing with the Webrecorder and Archive.org teams how we can to share these site-specific workarounds as cross-compatible plugins (aka "behaviors") between our various software.
> it quietly leaks everything you archive to archive.org by default
It's prominently mentioned many times (at least 4) on our homepage that this is the default, and archiving public-only sites (which are already fair game for Archive.org) is a default for good reason. Archiving private content requires several important changes and security considerations. More context: https://news.ycombinator.com/item?id=26866689
I love Archive Box btw, thank you for your effort! It's filling a very important need.
Sending everything to archive.org is bad default value and it erodes a certain level of trust in the project. Requiring "several important changes and security considerations" just makes a non-starter. The default settings should be "safe" for the default user, because as you mentioned in that post, 90% of users are never going to change them. Users should be able to run it locally and archive data without worrying about security issues, unless you only want experts to be able to use your software.
Also a contradiction between your statement and your blogpost, someone saving their photos isn't going to be want to worry about whether they configured your tool correctly or leaking all the group logs or grandma's photos.
>It's prominently mentioned many times (at least 4) on our homepage that this is the default, and archiving public-only sites (which are already fair game for Archive.org) is a default for good reason. Archiving private content requires several important changes and security considerations. More context
> Who cares about saving stuff?
> All of us have content that we care about, that we want to see preserved, but privately:
> families might want to preserve their photo albums off Facebook, Flickr, Instagram
> individuals might want to save their bookmarks, social feeds, or chats from Signal/Discord
> companies might want to save their internal documents, old sites, competitor analyses, etc.
I want the project to do well but it really needs to be secure by default.
I 100% agree, but because private archiving is doable but NOT 100% safe yet I cant make that mode the default. The difficult reality currently is that archiving anything non-public is not simple to make safe.
Every capture will contain reflected session cookies, usernames, and PII, and other sensitive content. People don't understand that this means if they share a snapshot of one page they're potentially leaking their login credentials for an entire site.
It is possible to do safely, and we provide ways to achieve that that I'm constantly working on improving, but until it's easy and straightforward and doesn't require any user education on security implications, I cant make it the default.
The goal is to get it to the point where it CAN be the default, but I'm still at least 6mo away from that point. Check out the archivebox/sessions dir in the source code for a look at the development happening here.
Until then, it requires some user education and setting up a dedicated chrome profile + cookies + tweaking config to do. (as an intentional barrier to entry for private archiving)
> And that was this week's newsletter! Congratulation for reading to the bottom, dear 198.51.100.1.
Even if the archivebox instance noted its own IP to do a search-and-replace like s|198\.51\.100\.1|XXX.XXX.XXX.XXX| on the snapshot it is about to create, it's possible to craft a response that obscures the presence of the information, such as by encoding the IP like this: MTk4LjUxLjEwMC4xCg==. I.e. steganography (https://en.wikipedia.org/wiki/Steganography).
Being able to anonymize archives before sharing them is something I would find interesting, but I don't think you can beat steganography, so I'm wondering what exactly you mean you plan to do.
I've been very impressed by all of your responses in here, but that one in particular shows empathy, compassion, and a deep deep subject matter expertise.
I can see why you would want such a tool, but it seems like a direct divergence from the core goal of the existing codebase.
Archivebox has no association with archive.org. Sending URLs to archive.org is just one of its features, which can also be turned off.