Gwtar: A static efficient single-file HTML format
308 points
1 month ago
| 21 comments
| gwern.net
| HN
simonw
1 month ago
[-]
TIL about window.stop() - the key to this entire thing working, it's causes the browser to stop loading any more assets: https://developer.mozilla.org/en-US/docs/Web/API/Window/stop

Apparently every important browser has supported it for well over a decade: https://caniuse.com/mdn-api_window_stop

Here's a screenshot illustrating how window.stop() is used - https://gist.github.com/simonw/7bf5912f3520a1a9ad294cd747b85... - everything after <!-- GWTAR END is tar compressed data.

Posted some more notes on my blog: https://simonwillison.net/2026/Feb/15/gwtar/

reply
moritzwarhier
1 month ago
[-]
Not the inverse, but for any SPA (not framework or library) developers seeing this, it's probably worth noting that this is not better than using document.write, window.open and simular APIs.

But could be very interesting for use cases where the main logic lives on the server and people try to manually implement some download- and/or lazy-loading logic.

Still probably bad unless you're explicitly working on init and redirect scripts.

reply
Lerc
1 month ago
[-]
I wonder if this is compatible with Claude Artifacts.

I made my own bundler skill that lets me publish artifacts https://claude.ai/public/artifacts/a49d53b6-93ee-4891-b5f1-9... that can be decomposed back into the files, but it is just a compressed base64 chunk at the end.

I guess the next question will be if it does work in environments that let you share a single file, will they disable this ability once they find out people are using it.

reply
8n4vidtmkvmk
1 month ago
[-]
Neat! I didn't know about this either.

Php has a similar feature called __halt_compiler() which I've used for a similar purpose. Or sometimes just to put documentation at the end of a file without needing a comment block.

reply
BobbyTables2
1 month ago
[-]
Sounds delicious for poisoning search engine crawlers and other bots…
reply
tym0
1 month ago
[-]
I was on board until I saw that those can't easily be opened from a local file. Seems like local access is one of the main use case for archival formats.
reply
NoMoreNicksLeft
1 month ago
[-]
Html is already a good single-file html format. Images can be inlined with data-uri. CSS and javascript have been inlineable since the very beginning. What more is needed? Fonts? Data-uri, once more.

Hell, html is probably what word processor apps should be saving everything as. You can get pixel-level placement of any element if you want that.

reply
Quarrel
1 month ago
[-]
They explicitely contrast it with single file html, giving an example that is much more performant than waiting for the single 280Mb html file to load.

Yes, they're both approximately the same in terms of size on disk and even network traffic for a fully loaded page, one is a much better browser experience.

> You can get pixel-level placement of any element if you want that.

You may well be able to, but it is largely anathema to the goals of html.

reply
avaer
1 month ago
[-]
Agreed, I was thinking it's like asm.js where it can "backdoor pilot" [1] an interesting use case into the browser by making it already supported by default.

But not being able to "just" load the file into a browser locally seems to defeat a lot of the point.

[1] https://en.wikipedia.org/wiki/Television_pilot#Backdoor_pilo...

reply
deevus
1 month ago
[-]
Could it be solved with a viewer program? Any static HTML server?
reply
WorldMaker
1 month ago
[-]
Any static HTML server. Also if you try to load the page directly it suggests just untarring the contents back into a folder structure and provides a perl command line as a suggestion for how to do that.
reply
qingcharles
1 month ago
[-]
It sounds like it would be pretty easy to write a super simple app with a browser in it that you could associate with the file type to spin these up. IMO.
reply
vessenes
1 month ago
[-]
I mean `claude -p "spin up a python webserver in this directory please"` or alternately `python -m http.server 8080 --bind 127.0.0.1 --directory .` is not hard
reply
nunobrito
1 month ago
[-]
Sure, but opening ports tends to be a headache when all you want to do is view the contents.

On this case I wonder if the format can be further optimized. For example, .js files are supported for loading locally and albeit a very inefficient way to load assets, it could overcome this local disk limitation and nobody reads the HTML source code in either way so it won't need to win any code beauty contests. I'll later look into this theory and ping the author in case it works.

reply
p410n3
1 month ago
[-]
> For example, .js files are supported for loading locally

Technically yes, but last time i tried that it gives you CORS errors now. You can start your browser with CORS disabled, but that makes sharing your .html file impossible. So back we go to inlining stuff :)

reply
nunobrito
1 month ago
[-]
Yep, inlining it seems. :-(
reply
WorldMaker
1 month ago
[-]
`npx http-server` anywhere node is installed
reply
chungy
1 month ago
[-]
althttpd is even easier. :)
reply
bmn__
1 month ago
[-]
Wouldn't recommend, only packaged on Alpine and nix.
reply
chungy
1 month ago
[-]
It's a single-binary program that's easy to compile. You don't need to depend on packaging...
reply
nunobrito
1 month ago
[-]
In case the author is reading: Please consider to add official fields for an optional screenshot of the page in BASE64 encoding and permit to add an (optional) description. Would also help to have official fields to specify the ISO time stamp when the archival took place.

As final wish list, would be great to have multiple versions/crawls of the same URL with deduplication of static assets (images, fonts) but this is likely stretching too much for this format.

reply
gwern
1 month ago
[-]
Allowing more metadata might be useful. You can add anything to the manifest at build time as assets are not required to be loaded or ever used (because this is impossible to statically check). I suppose we'd have to define an official prefix like 'gwtar-metadata-*' with like a 'gwtar-metadata-screenshot' and 'gwtar-metadata-desciption'... Not obvious what the best way forward is there, you don't want to add a whole bunch of ad hoc metadata fields, everyone will have a different one they want. Exif...?

Multiple versions or multiple pages (maybe they can be the same thing?) would be nice but also unclear how to make that. An iframe wrapper?

I considered and rejected deduplication and compression. Those can be done by the filesystem/server transparent to the format. (If there's an image file duplicated across multiple pages, then it should be trivial for any filesystem or server to detect or compress those away.)

reply
nunobrito
1 month ago
[-]
If possible, I'd ask for a shorter tag name to keep it more readable. For example: "gwtar-screenshot" and "gwtar-description" would work. I've just asked to make it official because otherwise is difficult to get different parsers to agree in the future.

> An iframe wrapper?

The way Archive.org does this navigation between multiple versions is quite pleasant to use. Don't know for sure but might be an iframe added on top.

reply
calebm
1 month ago
[-]
Very cool idea. I think single-file HTML web apps are the most durable form of computer software. A few examples of Single-File Web Apps that I wrote are: https://fuzzygraph.com and https://hypervault.github.io/.
reply
zetanor
1 month ago
[-]
The author dismisses WARC, but I don't see why. To me, Gwtar seems more complicated than a WARC, while being less flexible and while also being yet another new format thrown onto the pile.
reply
simonw
1 month ago
[-]
I don't think you can provide a URL to a WARC that can be clicked to view its content directly in your browser.
reply
zetanor
1 month ago
[-]
At the very least, WARC could have been used as the container ("tar") format after the preamble of Gwtar. But even there, given that this format doesn't work without a web server (unlike SingleFile, mentioned in the article), I feel like there's a lot to gain by separating the "viewer" (Gwtar's javascript) from the content, such that the viewer can be updated over time without changing the archives.

I certainly could be missing something (I've thought about this problem for all of a few minutes here), but surely you could host "warcviewer.html" and "warcviewer.js" next to "mycoolwarc.warc" "mycoolwrc.cdx" with little to no loss of convenience, and call it a day?

reply
gwern
1 month ago
[-]
You could potentially use WARC instead of Tar as the appended container, sure, but that's a lot of complexity, WARC doesn't serialize the rendered page (so what is the greater 'fidelity' actually getting you?) and SingleFile doesn't support WARC, and I don't see a specific advantage that a Gwtar using WARC would have. The page rendered what it rendered.

And if you choose to require separate files and break single-file, then you have many options.

> surely you could host "warcviewer.html" and "warcviewer.js" next to "mycoolwarc.warc" "mycoolwrc.cdx"

I'm not familiar with warcviewer.js and Googling isn't showing it. Are you thinking of https://github.com/webrecorder/wabac.js ?

reply
zetanor
1 month ago
[-]
I should have been a bit more verbose as I didn't mean to send anyone on a wild goose chase. The "warcviewer.{html,js}" part was just a hypothetical viewer to illustrate having a static client-side "web app" that functions much like Gwtar, but separately from payloads.

To expand what I have in mind, it'd be a script like Gwtar, except it loads WARCs through URLs to CDX files. Alternatively, it might also load WARC files fully to memory, where an index could be constructed on the fly. In the latter case, that would allow the same viewer to be used with or without a web server. Though, I can imagine that loading archives without a web server was probably out-of-scope for Gwtar, otherwise something could have been figured out (e.g., putting the tar in a <textarea>'s RCDATA; do browsers support "binary" data in there correctly?).

While the WARC specs are a mess (sometimes quite ambiguous), I've never had much trouble reading or writing them. As for why WARC, having the option to preserve request/response metadata, as well as having interoperability with anything else in the WARC ecosystem, would be nice. Also, a separate viewer would naturally be updateable without changing the archive files themselves.

reply
cxr
1 month ago
[-]
> e.g., putting the tar in a <textarea>'s RCDATA; do browsers support "binary" data in there correctly?

A <script> data block would be the officially sanctioned way to do it. (This use case is part of the spec.)

reply
gwern
1 month ago
[-]
I see. You could probably build something on top of wabac.js... But you'd need some sort of multi-file setup to support the indirection, I suppose.

> I imagine that loading archives without a web server was probably out-of-scope for Gwtar

More that it's just not important to us. I don't even look at the archives 'locally'. They are all archives of public web pages, which I just rehost publicly. When I want to look at them, I open them on Gwern.net like anyone else!

And if I really needed to, for some reason, it's literally a Bash one-liner (already provided inside the Gwtar as well as my writeup) to turn them back into a normal multi-file HTML. (This is a lot more than you can say for a WARC...) So my reaction to the complaints about lacking local viewing is mostly just ¯\_(ツ)_/¯

> (e.g., putting the tar in a <textarea>'s RCDATA; I wonder how well browsers support "binary" data in there?)

I don't know the details but you can just base-encode them, so I suppose that's an option, as long as you rewrote the ranges appropriately, maybe?

(Also worth noting that you can go the other way: if you really desperately want to preserve the raw header responses, you can just use the flexibility of Gwtar to append the WARC to the end of the file. As long as the range requests work, users won't download that part. The duplication is not so great for long-term storage, but you can just XZ them and that should remove duplication and overhead.)

reply
obscurette
1 month ago
[-]
WARC is mentioned with very specific reason not being good enough: "WARCs/WACZs achieve static and efficient, but not single (because while the WARC is a single file, it relies on a complex software installation like WebRecorder/Replay Webpage to display)."
reply
gildas
1 month ago
[-]
I would like to know why ZIP/HTML polyglot format produced by SingleFile [1] and mentioned in the article "achieve static, single, but not efficiency". What's not efficient compared to the gwtar format?

[1] https://github.com/gildas-lormeau/Polyglot-HTML-ZIP-PNG

reply
gwern
1 month ago
[-]
'efficiency' is downloading only the assets needed to render the current view. How does it implement range requests and avoid downloading the entire SingleFileZ when a web browser requests the URL?
reply
gildas
1 month ago
[-]
I haven't looked closely, but I get the impression that this is an implementation detail which is not really related to the format. In this case, a polyglot zip/html file could also interrupt page loading via a window.stop() call and rely on range requests (zip.js supports them) to unzip and display the page. This could also be transparent for the user, depending on whether the file is served via HTTP or not. However, I admit that I haven't implemented this mechanism yet.
reply
gwern
1 month ago
[-]
> that this is an implementation detail which is not really related to the format. In this case, a polyglot zip/html file could also interrupt page loading via a window.stop() call...However, I admit that I haven't implemented this mechanism yet.

Well, yes. That's why we created Gwtar and I didn't just use SingleFileZ. We would have preferred to not go to all this trouble and use someone else's maintained tool, but if it's not implemented, then I can't use it.

(Also, if it had been obvious to you how to do this window.stop+range-request trick beforehand, and you just hadn't gotten around to implementing it, it would have been nice if you had written it up somewhere more prominent; I was unable to find any prior art or discussion.)

reply
gildas
1 month ago
[-]
The reason I did not implement the innovative mechanism you describe is because, in my case, all the technical effort was/is focused on reading the archive from the filesystem. No one has suggested it either.

Edit: Actually, SingleFile already calls window.stop() when displaying a zip/html file from HTTP, see https://github.com/gildas-lormeau/single-file-core/blob/22fc...

reply
gwern
1 month ago
[-]
What does that do?
reply
gildas
1 month ago
[-]
The call to window.stop() stops HTML parsing/rendering, which is unnecessary since the script has downloaded the page via HTTP and will decompress it as-is as a binary file (zip.js supports concatenated payloads before and after the zip data). However, in my case, the call to window.stop() is executed asynchronously once the binary has been downloaded, and therefore may be too late. This is probably less effective than in your case with gtwar.

I implemented this in the simplest way possible because if the zip file is read from the filesystem, window.stop() must not be called immediately because the file must be parsed entirely. In my case, it would require slightly more complex logic to call window.stop() as early as possible.

Edit: Maybe it's totally useless though, as documented here [1]: "Because of how scripts are executed, this method cannot interrupt its parent document's loading, but it will stop its images, new windows, and other still-loading objects." (you mentioned it in the article)

[1] https://developer.mozilla.org/en-US/docs/Web/API/Window/stop

Edit #2: Since I didn't know that window.call() was most likely useless in my case, I understand your approach much better now. Thank you very much for clarifying that with your question!

reply
gwern
1 month ago
[-]
Well, it seems easy enough to test if you think you are getting efficiency 'for free'. Dump a 10GB binary into a SingleFileZ, and see if your browser freezes.
reply
gildas
1 month ago
[-]
I just ran a test on a 10GB HTML page and called window.stop() via a 100ms setTimeout, which, in my opinion, simulates what would happen in a better-implemented case in SingleFile if the call to window.stop() were made as soon as the HTTP headers of the fetch request are received (i.e. easy fix). And it actually works. It interrupts the loading at approx. 15MB of data, the rendering of the page, and it's partially and smoothly displayed (no freeze). So it's not totally useless but it deserves to be optimized at a minimum in SingleFile, as I indicated. In the end, the MDN documentation is not very clear...

Edit: I've just implemented the "good enough of my machine fix" aka the "easy fix", https://github.com/gildas-lormeau/single-file-core/commit/a0....

Edit #2: I've just understood that "parent" in "this method cannot interrupt its *parent* document's loading" from the MDN doc probably means the "parent" of the frame (when the script is running into it).

reply
gwern
1 month ago
[-]
OK, so assuming you clean that up a bit and this becomes officially supported in SingleFile/SingleFileZ, what is missing compared to Gwtar? Anything important or just optional features like image recompression and PAR2?
reply
westurner
1 month ago
[-]
Does this verify and/or rewrite the SRI integrity hashes when it inlines resources?

Would W3C Web Bundles and HTTP SXG Signed Exchanges solve for this use case?

WICG/webpackage: https://github.com/WICG/webpackage#packaging-tools

"Use Cases and Requirements for Web Packages" https://datatracker.ietf.org/doc/html/draft-yasskin-wpack-us...

reply
gwern
1 month ago
[-]
> Does this verify and/or rewrite the SRI integrity hashes when it inlines resources?

As far as I know, we do not have any hash verification beyond that built into TCP/IP or HTTPS etc. I included SHA hashes just to be safe and forward compatible, but they are not checked.

There's something of a question here of what hashes are buying you here and what the threat model is. In terms of archiving, we're often dealing with half-broken web pages (any of whose contents may themselves be broken) which may have gone through a chain of a dozen owners, where we have no possible web of trust to the original creator, assuming there is even one in any meaningful sense, and where our major failure modes tend to be total file loss or partial corruption somewhere during storage. A random JPG flipping a bit during the HTTPS range request download from the most recent server is in many ways the least of our problems in terms of availability and integrity.

This is why I spent a lot more time thinking about how to build FEC in, like with appending PAR2. I'm vastly more concerned about files being corrupted during storage or the chain of transmission or damaged by a server rewriting stuff, and how to recover from that instead of simply saying 'at least one bit changed somewhere along the way; good luck!'. If your connection is flaky and a JPEG doesn't look right, refresh the page. If the only Gwtar of a page that disappeared 20 years ago is missing half a file because a disk sector went bad in a hobbyist's PC 3 mirrors ago, you're SOL without FEC. (And even if you can find another good mirror... Where's your hash for that?)

> Would W3C Web Bundles and HTTP SXG Signed Exchanges solve for this use case?

No idea. It sounds like you know more about them than I do. What threat do they protect against, exactly?

reply
westurner
1 month ago
[-]
The IETF spec lists a number of justifying use cases. SXG was rejected then for a number of reasons IIUC

Browsers check SRI integrity hashes if they're there

There's HTTP-in-RDF, and Memento protocol. VCR.py and similar can replay HTTP sessions, but SSL socket patching or the TLS cookie or adding a cert for e.g. an archiving https proxy is necessary

Browser Devtools can export HAR HTTP archives

If all of the resource origins are changed to one hostname for archival, that bypasses same origin controls on js and cookies; such that the archived page runs all the scripts in the same origin that the archive is served from? Also, Browsers have restrictions on even inline JS scripts served from file:/// urls.

FWIU Web Bundles and SXG were intended to preserve the unique origins of resources in order to safely and faithfully archive for interactive offline review.

reply
pseudosavant
1 month ago
[-]
I’ve thought about doing something similar, but at the Service Worker layer so the page stays the same and all HTTP requests are intercepted.

Similar to the window.stop() approach, requests would truncate the main HTML file while the rest of that request would be the assets blob that the service worker would then serve up.

The service worker file could be a dataURI to keep this in one file.

reply
mr_mitm
1 month ago
[-]
Pretty cool. I made something similar (much more hacky) a while ago: https://github.com/AdrianVollmer/Zundler

Works locally, but it does need to decompress everything first thing.

reply
gwern
1 month ago
[-]
So this is like SingleFileZ in that it's a single static inefficient HTML archive, but it can easily be viewed locally as well?

How does it bypass the security restrictions which break SingleFileZ/Gwtar in local viewing mode? It's complex enough I'm not following where the trick is and you only mention single-origin with regard to a minor detail (forms).

reply
mr_mitm
1 month ago
[-]
The content is in an iframe, my code is outside of it, and the two frames are passing messages back and forth. Also I'm monkey patching `fetch` and a few other things.
reply
gwern
1 month ago
[-]
OK, but how does that get you 'efficiency' if you're doing this weird thing where you serialize the entire page into some JSON blob and pass it in to an iframe or whatever? That would seem to destroy the 'efficiency' property of the trilemma. How do you get the full set of single-file, static, and efficient, while still working locally?
reply
mr_mitm
1 month ago
[-]
I suppose it's not efficient in that sense. As I said, the browser has to unpack the whole thing first, so you see a spinner when opening large files. But it's inserted into the DOM on demand, so it's not overwhelming the browser because it's just a blob in memory.
reply
gwern
1 month ago
[-]
OK, so your format doesn't solve the trilemma; you have 'static' and 'single-file', but not 'efficient'. Seems like you might as well just go with MHT or SingleFile then...?
reply
overgard
1 month ago
[-]
Interesting, but I'm kind of confused why you'd need lazy loads for a local file? Like, how big are these files expected to be? (Or is the lazy loading just to support lazy loading its already doing?)
reply
skybrian
1 month ago
[-]
I believe the idea is that it's not local. It's a very large file on an HTTP server (required for range requests) and you don't want to download the whole thing over the network.

Of course, since it's on an HTTP server, it could easily handle doing multiple requests of different files, but sometimes that's inconvenient to manage on the server and a single file would be easier.

Maybe this is downstream of Gwern choosing to use MediaWiki for his website?

reply
gwern
1 month ago
[-]
Yes, network is assumed. If it's local, there's no problem, just use MHT or SingleFile!

> Maybe this is downstream of Gwern choosing to use MediaWiki for his website?

This has nothing at all to do with the choice of server. The benefit of being a single-file, with zero configuration or special software required by anyone who ever hosts or rehosts a Gwtar in the future, would be true regardless of what wiki software I run.

(As it happens, Gwern.net has never used MediaWiki, or any standard dynamic CMS. It started as Gitit, and is now a very customized Hakyll static site with a lot of nginx options. I am surprised you thought that because Gwern.net looks nothing like any MediaWiki installation I have seen.)

reply
skybrian
1 month ago
[-]
Yeah I'm not sure why I thought that.
reply
renewiltord
1 month ago
[-]
Hmm, I’m interested in this, especially since it applies no compression delta encoding might be feasible for daily scans of the data but for whatever reason my Brave mobile on iOS displays a blank page for the example page. Hmm, perhaps it’s a mobile rendering issue because Chrome and Safari on iOS can’t do it either https://gwern.net/doc/philosophy/religion/2010-02-brianmoria...
reply
isr
1 month ago
[-]
Hmm, so this is essentially the appimage concept applied to web pages, namely:

- an executable header

- which then fuse mounts an embedded read-only heavily compressed filesystem

- whose contents are delivered when requested (the entire dwarf/squashfs isn't uncompressed at once)

- allowing you to pack as many of the dependencies as you wish to carry in your archive (so, just like an appimage, any dependency which isn't packed can be found "live"

- and doesn't require any additional, custom infrastructure to run/serve

Neat!

reply
karel-3d
1 month ago
[-]
The example link doesn't work for me at all in iOS safari?

https://gwern.net/doc/philosophy/religion/2010-02-brianmoria...

I will try on Chrome tomorrow.

reply
woodruffw
1 month ago
[-]
It also doesn't work on desktop Safari 26.2 (or perhaps it does, but not to the extent intended -- it appears to be trying to download the entire response before any kind of content painting.)
reply
Retr0id
1 month ago
[-]
It's fairly common for archivers (including archive.org) to inject some extra scripts/headers into archived pages or otherwise modify the content slightly (e.g. fixing up relative links). If this happens, will it mess up the offsets used for range requests?
reply
gwern
1 month ago
[-]
The range requests are to offsets in the original file, so I would think that most cases of 'live' injection do not necessarily break it. If you download the page and the server injects a bunch of JS into the 'header' on the fly and the header is now 10,000 bytes longer, then it doesn't matter, since all of the ranges and offsets in the original file remain valid: the first JPG is still located starting at offset byte #123,456 in $URL, the second one is located starting at byte #456,789 etc, no matter how much spam got injected into it.

Beyond that, depending on how badly the server is tampering with stuff, of course it could break the Gwtar, but then, that is true of any web page whatsoever (never mind archiving), and why they should be very careful when doing so, and generally shouldn't.

Now you might wonder about 're-archiving': if the IA serves a Gwtar (perhaps archived from Gwern.net), and it injects its header with the metadata and timeline snapshot etc, is this IA Gwtar now broken? If you use a SingleFile-like approach to load it, properly force all references to be static and loaded, and serialize out the final quiescent DOM, then it should not be broken and it should look like you simply archived a normal IA-archived web page. (And then you might turn it back into a Gwtar, just now with a bunch of little additional IA-related snippets.) Also, note that the IA, specifically, does provide endpoints which do not include the wrapper, like APIs or, IIRC, the 'if_/' fragment. (Besides getting a clean copy to mirror, it's useful if you'd like to pop up an IA snapshot in an iframe without the header taking up a lot of space.)

reply
iainmerrick
1 month ago
[-]
I agree with the motivation and I really like the idea of a transparent format, but the first example link doesn’t work at all for me in Safari.
reply
malkia
1 month ago
[-]
Anyone else - GWAAAR! - G.W.A.R! - I guess the only metal nerd here
reply
disce-pati
1 month ago
[-]
the first thing i thought too
reply
O1111OOO
1 month ago
[-]
I gave up a long time ago and started using the "Save as..." on browsers again. At the end of the day, I am interested in the actual content and not the look/feel of the page.

I find it easier to just mass delete assets I don't want from the "pageTitle_files/" directory (js, images, google-analytics.js, etc).

reply
mikae1
1 month ago
[-]
Have you https://addons.mozilla.org/firefox/addon/single-file/?

If you really just want the text content you could just save markdown using something like https://addons.mozilla.org/firefox/addon/llmfeeder/.

reply
O1111OOO
1 month ago
[-]
> Have you https://addons.mozilla.org/firefox/addon/single-file/

Yes I have. I tried maff, mht, SingleFile and some others over the years. MAFF was actually my goto for many years because it was just a zip container. It felt future-proof for a long time until it wasn't (I needed to manually extract contents to view once the supporting extension was gone).

I seem to recall that MHT caused me a little more of a conversion problem.

It was my concern for future-proofing that eventually led me back to "Save As..".

My first choice is "Save as..." these days because I just want easy long-term access to the content. The content is always the key and picking and choosing which asset to get rid of is fairly easy with this. Sometimes it's just all the JS/trackers/ads, etc..

If "Save as..." fails, I'll try 'Reader Mode' and attempt "Save as.." again (this works pretty well on many sites). As a last resort I'll use SingleFile (which I like too - I tested it on even DOS browsers from the previous century and it passed my testing).

A locally saved SingleFile can be loaded into FF and I can always perform a "Save As..." on it if I wanted to for some reason (eg; smaller file, js-trackers, cleaner HTML, etc).

reply
ninalanyon
1 month ago
[-]
On the subject of SingleFile there is also WebScrapBook: https://github.com/danny0838/webscrapbook

I prefer it because it can save without packing the assets into one HTML file. Then it's easy to delete or hardlink common assets.

reply
venusenvy47
1 month ago
[-]
I see that it gives three choices for saving the assets: single file, zip or folder. Is the zip version just zipping the folder?
reply
ninalanyon
1 month ago
[-]
I don't know, I've never tried it. I picked it because of the folder option which makes grepping for content easier and faster.
reply
gwern
1 month ago
[-]
I find that 'save as' horribly breaks a lot of web pages. There's no choice these days but to load pages with JS and serialize out the final quiescent DOM. I also spend a lot of time with uBlock Origin and AlwaysKillSticky and NoScript wrangling my archive snapshots into readability.
reply
TiredOfLife
1 month ago
[-]
Save as doesn't work on sites that lazy load.
reply
spankalee
1 month ago
[-]
I really don't understand why a zip file isn't a good solution here. Just because is requires "special" zip software on the server?
reply
gwern
1 month ago
[-]
> Just because is requires "special" zip software on the server?

Yes. A web browser can't just read a .zip file as a web page. (Even if a web browser decided to try to download, and decompress, and open a GUI file browser, you still just get a list of files to click.) Therefore, far from satisfying the trilemma, it just doesn't work.

And if you fix that, you still generally have a choice between either no longer being single-file or efficiency. (You can just serve a split-up HTML from a single ZIP file with some server-side software, which gets you efficiency, but now it's no longer single-file; and vice-versa. Because if it's a ZIP, how does it stop downloading and only download the parts you need?)

reply
spankalee
1 month ago
[-]
We're talking about servers here - the article specifically said that one of the requirements was no special _server_ software, and a web server almost certainly has zip (or tar) installed. These gwtar files don't work without a server apparently either.
reply
gwern
1 month ago
[-]
I'm not following your point here. Yes, a web server (probably) has access to zip/tar utilities, but so what? That doesn't automagically make a random .zip jump through hoops to achieve anything beyond 'download like a normal binary asset'. That's what a ZIP file does. Meanwhile, Gwtar works with any server out of the box: it is just a HTML file using a pre-existing HTTP standardized functionality, and works even if the server declines to support range requests for some wacky reason like undocumented Cloudflare bugs, and downgrades RANGE to GET. (It just loses efficiency, but it still works, you know, in the way that a random .zip file doesn't work at all as a web page.) You can upload a Gwtar to any HTTP server or similar thing like an AWS bucket and it will at least work, zero configuration or plugins or additional executables or scripting.

Now, maybe you mean something like, 'a web server could additionally run some special CGI software or a plugin or do some fancy Lua scripting in order to munge a ZIP and split it up on the fly so as to do something like serve it to clients as a regular efficient multi-file HTML page'. Sure. I already cover that in the writeup, as we seriously considered this and got as far as writing a Lua nginx script to support special range requests. But then... it's not single-file. It's multi-file - whatever the additional special config file, script, plugin, or executable is.

reply
bandie91
1 month ago
[-]
> The main header JS starts using range requests to first load the real HTML, and then it watches requests for resources; the resources have been rewritten to be deliberately broken 404 errors (requesting from localhost, to avoid polluting any server logs)

what if a web server on localhost happens to handle the request? why not request from a guaranteed unaccessable place like http://0.0.0.0/ or http://localhost:0/ (port zero)

reply
gwern
1 month ago
[-]
If those were guaranteed inaccessible, wouldn't a web browser be within its rights to optimize those away?
reply
bandie91
1 month ago
[-]
either it optimizes and dont try to connect or does not recognize as never accessible and does try to connect – both are better than accitentally fetching something from a web service running on localhost.
reply
tefkah
1 month ago
[-]
this is really really cool, this makes archiving so much easier!

great job

reply
nullsanity
1 month ago
[-]
Gwtar seems like a good solution to a problem nobody seemed to want to fix. However, this website is... something else. It's full of inflated self impprtantance, overly bountiful prose, and feels like someone never learned to put in the time to write a shorter essay. Even the about page contains a description of the about page.

I don't know if anyone else gets "unemployed megalomaniacal lunatic" vibes, but I sure do.

reply
3rodents
1 month ago
[-]
gwern is a legendary blogger (although blogger feels underselling it… “publisher”?) and has earned the right to self-aggrandize about solving a problem he has a vested interest in. Maybe he’s a megalomaniac and/or unemployed and/or writing too many words but after contributing so much, he has earned it.
reply
TimorousBestie
1 month ago
[-]
I was more willing to accept gwern’s eccentricities in the past but as we learn more about MIRI and its questionable funding resources, one wonders how much he’s tied up in it.

The Lighthaven retreat in particular was exceptionally shady, possibly even scam-adjacent; I was shocked that he participated in it.

reply
neversupervised
1 month ago
[-]
I’ve been to Lighthaven many times and it has always been great. Can you explain what you’re talking about?
reply
k33n
1 month ago
[-]
What does any of that have to do with the value of what’s presented in the article?
reply
isr
1 month ago
[-]
Wow, thats one hell of a reaction to someone's blog post introducing their new project.

Its almost as if someone charged you $$ for the privilege of reading it, and you now feel scammed, or something?

Perhaps you can request a refund. Would that help?

reply
fluidcruft
1 month ago
[-]
What's up with the non-stop knee-jerk bullshit ad hom on HN lately?
reply
Krutonium
1 month ago
[-]
We're tired, chief.
reply
esseph
1 month ago
[-]
The earth is falling out from under a lot of people, and they're trying to justify their position on the trash heap as the water level continues to rise around it. It's a scary time.
reply
TimorousBestie
1 month ago
[-]
Technically it’s only an ad hominem when you’re using the insult as a component in a fallacious argument; the parent comment is merely stating an aesthetic opinion with more force than is typically acceptable here.
reply
isr
1 month ago
[-]
I read your BRILLIANT synopsis in the tone of Sir Humphrey (the civil servant) from "Yes Minister". Fits perfectly. Take a bow, good sir ...
reply