Is stuff online worth saving?
120 points
4 days ago
| 36 comments
| rubenerd.com
| HN
JKCalhoun
13 hours ago
[-]
One hundred twenty-three years ago my great grandmother's first husband died in a hotel in Kansas City from asphyxiation from the gas having been left on over night (the hotel did not yet have electric lighting). A letter was hastily written on a piece of hotel stationary to be delivered to his wife in the neighboring farming community where she lived.

It is fortunate to me that someone thought to hang on to that note since I have become interested in genealogy and this was a fairly significant event in family history (had he not died I don't suppose I would be around since it was her second marriage that gave me my grandfather).

I long for scraps of anything that my dead relatives, wrote, created, etc. It connects me better to the past — the lives they lived, how they lived them. It somehow grounds me a little better ... well, it's rather hard to explain the draw of genealogy.

Sadly very little of the ephemera of everyday life was kept. I get it. It might have seemed like hanging on to junk mail — like you were a hoarder or whatever, but in this digital era we should be able to hold terabytes of what may appear to be ephemera.

I'm doing what I can – not for ego, I think, but for future generations that may find a connection to their past interesting.

reply
sangnoir
2 hours ago
[-]
> ...well, it's rather hard to explain the draw of genealogy.

I've noticed people becoming more interested in genealogy when they - let me phrase this delicately - reach a certain age. My speculation is that it is a component of grappling with one's own mortality. As the grays and wrinkles multiply, some obsess over healthy eating and exercise, some wealthier ones invest in immortality research, some get blood boys, and the rest feel an urgent need to research our genealogy; any detritus that shows our progenitors existed proves some trace of us having been here will remain, and perhaps our existence means something, as time cruelly keeps marching on.

reply
willis936
12 hours ago
[-]
30 years ago there was no digital world. Nearly all information was in physical artifacts. The things worth saving haven't really changed, but the amount of noise they are buried in has. Imagine if that letter was kept in a two ton pile of ad fliers. Sure, someone would find some of those fliers interesting, but you'd have been much less likely to even know about the letter.
reply
jonhohle
11 hours ago
[-]
An aside about ad spam from companies that I occasionally buy from:

Often as spam comes from the same mailbox as order receipts and includes words like “order” while messages with receipts never include the word “receipt”. When inundated with daily or sometimes multiple times a day ad spam from the same company it becomes very difficult to filter for only not receipts, to clean a neglected inbox.

After I’m gone, I fully expect my family just to delete it all because the signal to noise is so low.

reply
sdenton4
11 hours ago
[-]
Sorting through twenty years of spammy email is one of those things that seem like an llm would actually be good for.
reply
be_erik
8 hours ago
[-]
Some might say, that years of spammy emails drove the creation of the llms we know today. It's easy to forget how fast some things have moved: https://en.wikipedia.org/wiki/Naive_Bayes_spam_filtering
reply
justsomehnguy
10 hours ago
[-]
I don't have anyone to do anything after I'm gone, so I just delete the emails myself. I do keep the notable ones, like registration information and some payment receipts but otherwise everything goes to the trash.

Bonus points:

I don't need 30/50/100Gb mailbox (and the associated mailbox cost nowadays).

Search is not only fast but if I didn't found something - then there is nothing of this something in the mailbox.

I't mentally pleasurable to log in once in a while and throw a bunch of unneeded stuff into the trash bin, quite similar to a real life room cleaning.

reply
ghaff
6 hours ago
[-]
Fortunately Gmail tabs go a lot of the way to letting you mass delete junk you don’t care about. Assuming you do even a modicum of labeling stuff you might like to refer to or act on, deleting at least older promotions and updates eliminates a lot of things.
reply
justsomehnguy
2 hours ago
[-]
Didn't use GMail for years but the labels were not quite up to the task.

Thankfully FastMail interface makes 'search from this address' and 'search to this address' (I'm using per-service addresses) and then 'select all', 'delete' actions a breeze.

reply
alex_young
6 hours ago
[-]
Well, I remember a lot of great stuff on Usenet circa 1994, but it looks like Google shut down access to it via Google Groups, which used to archive it in a searchable way.

There was a ton of great stuff 30 years ago, and I think it's definitely worth saving.

The Internet was a very different place, but it was quite real 30 years ago, and I think the idea that the further back you go the more valuable this kind of thing is is the right way of looking at it.

reply
palmfacehn
12 hours ago
[-]
>...a two ton pile of ad fliers

Alamy is selling scans of ad prints from the 1850s.

https://www.alamy.com/stock-photo/1850s-advert.html

reply
zamadatix
11 hours ago
[-]
A selection 74 items over a 10 year period is a different proposal compared to e.g. keeping two tons of ad fliers from November 17th 1907 (and every other thing, every other day, all the time).
reply
chefandy
11 hours ago
[-]
Ads range from a (necessary, in a capitalist society) nuisance to a scourge, and people justly put up increasingly thick boundaries to shield themselves from their influence. When waning cultural relevance or whatever dilutes that influence, you can more easily see the ads for what they are— often manipulative marketing tactics implemented through often genuinely beautiful art and design. Both aspects are fascinating to consider and the art can be quite enjoyable. Early modernist posters from Paris are beautiful. Watching collections of mid century television ads in the prelinger archives is fun, and tells us a lot about the ways we are influenced by modern ads speaking to current perspectives, fashions, and concerns.
reply
ANewFormation
8 hours ago
[-]
Capitalism would work 100% fine without ads because people naturally compare and contrast options when buying stuff.

All that's necessary is making it possible for people seeking out your type of product to find you. And for revolutionary products, there's word of mouth.

If anything I think capitalism would function better without ads, because I would argue that advertising overall results in less informed customers, especially the modern lifestyle/brand type of advertising that's clearly quite effective at manipulating people.

reply
janalsncm
5 hours ago
[-]
It’s an interesting question I guess (and slightly worrying that I can more easily imagine the end of the world than the end of advertising). Especially if we take it to the extreme and imagine sponsored listings also don’t exist. I guess incumbents would have a big advantage.

There are second order effects of ads that we’d need to consider. Facebook and Google wouldn’t exist as we know them. Maybe that means some of their research doesn’t happen?

reply
chefandy
5 hours ago
[-]
If there were no ads, how would people know that products existed? Would they just see the products on store shelves? What about services? Would labels be ads? Would how stores merchandise things be advertisements? Could businesses negotiate for specific product placement? How would you find out about stores? Would store signs be ads? How about really big ones? How about at the edge of their property along a road highway? Could the sign say what the store sold? If you were to start a product guide to help people find what they need, how could you possibly afford to buy enough products to be useful and up-to-date enough while slow crawl word of mouth got the business off the ground? Would asking people to tell their friends be an ad? If not, could you pay someone to spread the word about your product? Would traveling sales reps be ads? What if they wore head to toe logo gear? Could you just pay people to do that without selling things? Ads suck but I don’t see how a capitalist society could survive without them.
reply
janalsncm
5 hours ago
[-]
I think the definition would have to be an exchange of something of value for telling other people about a product. There are some companies that got off the ground with no paid advertising but I think they’re an exception. Generally people are not seeking out new products.
reply
chefandy
5 hours ago
[-]
But the whole point of a capitalist society is that competitors that do things better/cheaper start taking customers so the capital moves to the best and most efficient system.
reply
chgs
11 hours ago
[-]
Because they are rare
reply
chefandy
11 hours ago
[-]
I don’t think that’s true? Tons of stuff from that era had been digitized, even before newer more relevant stuff and older rarer stuff, because the acid paper had a short shelf life and there were so many ads in printed stuff then. I might have a skewed perspective from working in the digitization world for quite some time. I think they’re selling what they sell with all their other content— discovery, curation, preparation, and easy delivery.
reply
harrall
8 hours ago
[-]
It’s not like you currently go to a webpage and save all the images onto deep storage for archival… I’m not sure what relevance things being digital has on identifying noise.

If the ancestor before you is hoarding anything that comes across their path, be it digital ads or every physical greeting card they’ve ever gotten, the problem is with the person’s collection habits, not the medium.

reply
qwertox
12 hours ago
[-]
What about robots reading each flier and checking if something is odd about that particular one? It could find the letter and report it to you. Even easier if it was all digital information.
reply
bongodongobob
11 hours ago
[-]
If only we had search algorithms...
reply
eesmith
11 hours ago
[-]
A two-ton pile of ad fliers? Sounds like Ted Nelson's Junk Mail collection, https://archive.org/details/tednelsonjunkmail .
reply
waltbosz
8 hours ago
[-]
This reminds me of a recent flea market experience. There at some stand was boxes of old used post cards and 100 year old family photos. Photos of people posed on a porch in their Sunday best. Or just mundanely standing around a car not everyone looking at the camera.

It's hard to assign a value to these things. They are simultaneously junk and treasure. I think about the journey these items took to find their way to that flea market table. It was too diverse a collection to have come from one place. So I imagine all the paths each individual item traversed. The joy of the recipient reading a post card, holding on to it, rediscovering it on spring cleaning days. Or the photo living in an album or framed on a wall somewhere for a lifetime.

I'm not sure what the value of it all is if it just gets lugged around to various flea markets and sold piecemeal for $1 each.

reply
EvanAnderson
1 hour ago
[-]
> There at some stand was boxes of old used post cards and 100 year old family photos. Photos of people posed on a porch in their Sunday best. Or just mundanely standing around a car not everyone looking at the camera.

> I'm not sure what the value of it all is if it just gets lugged around to various flea markets and sold piecemeal for $1 each.

I purchase, scan, and resell those kinds of things. I'd love to have a centralized, public repository in which to store the data. As our tech gets better at extracting data from that material more and more interesting applications could be developed. Imagine being able to find 100+ year old photos of your ancestors via facial recognition and extracted metadata searches.

I wish I could come up with a non-profit business model that worked for preserving that kind of stuff. I would love to gather up the historical ephemera that's being lost, catalog it via manual and automated processes, and make it available to the public. (Yes, I am aware there are privacy concerns. It's a pie-in-the-sky idea. I just hate to see all of the previously captured and curated effort that went into ephemera cast to the winds.)

reply
wslh
3 hours ago
[-]
Regarding genealogy it is great to look at the work The Church of Jesus Christ of Latter-day Saints was doing that help genealogical researchers around the globe [1] beyond that specific church.

[1] https://newsroom.churchofjesuschrist.org/topic/genealogy

reply
immibis
6 hours ago
[-]
Now it's easier to save stuff, but there's more stuff to save. YouTubes and TikToks instead of text notes. Chat messages instead of letters.
reply
kerkeslager
10 hours ago
[-]
Sure, there are a ton of reasons to archive. And if it's cheap to do (in terms of money, yes, but also in terms of time, effort, mental health, etc.) then I am of the mind that we should archive everything.

But, it often isn't cheap to do, and in that case, it makes sense to prioritize. The high priority items for me are the things that I might want to share, the ideas I want to amplify for my contemporaries and future generations that might examine my life. Stuff like [1] [2] and [3] which has influenced my thinking fundamentally, that I hope to build upon so that others can build upon what I have built.

I'd argue that you do this intuitively: you're mentioning a letter from your family's past because it is a high priority item--it's relevant because it was the last written words of your great-grandmother's first husband.

But, there's a lot that isn't worth keeping. My first form of archiving as a teenager was keeping ticket stubs for movies and concerts--a decade later I was going through my pile and found that I didn't even remember most of them. The better movies, I remembered--and I had them on DVD. The better concerts, I remembered--and I also had journal entries and CDs to remember the experience and the music. It's not important to me where/when I saw Everything, Everywhere, All At Once in theaters, but I have it on DVD and I can't wait to show it to my niece when she's older. And sure, I saw Amigo the Devil live, but frankly, he's not an artist you need to see in concert--the greatest impact of Cocaine and Abel[4] on me was when I listened to it alone in my room. The ticket stubs simply don't matter to me.

[1] https://www.viridiandesign.org/notes/451-500/the_last_viridi...

[2] https://www.ted.com/talks/brene_brown_the_power_of_vulnerabi...

[3] https://digital.wpi.edu/pdfviewer/wm117p10z

[4] https://www.youtube.com/watch?v=ZzjtLm0G49E

EDIT: All the things linked above, I have backed up in one form or another. Notably, the Schutt paper isn't at its original URL.

reply
karmonhardan
1 hour ago
[-]
It's funny you mention ticket stubs, because I also have a similar collection, and I kind of treasure it. Before my Google tracking my every step, before Twitter, as the years go by, I have some record of what I was doing at exceedingly specific times and dates. It helps to structure my memories a bit more than I'd otherwise be able to. I scanned them all at once (in several pages), and it's sort of a map of my adolescence. I can jump across time. I would be sad to lose it. (Along with the photo of the tickets for my make-shift - and first - double feature of Everything Everywhere/Dr. Strange. Multiverse-themed, doncha know?)
reply
zdc1
8 hours ago
[-]
These days whenever I read an interesting article, I will take 2 minutes to copy and paste it into my Obsidian vault under my Articles folder. I'll take care to paste the images as images (and not links) and make sure I've got the author and source URL at the top, and have my separate notes section link to it. It's a bit silly and obsessive, but given how transient content on the Internet is, I think it's necessary to make a copy of anything you care about.
reply
Modified3019
7 hours ago
[-]
I use https://github.com/gildas-lormeau/SingleFile

I set it to tolerate longer processing times, and to open the file after saving so I can sanity check that it got everything. Works great at faithfully saving a page with images as it appears in browser, and saves so much time.

You might also have a look at https://github.com/ArchiveBox/ArchiveBox

reply
Modified3019
7 hours ago
[-]
Also, I believe by default the files are saved as plain html (with resources being base64 encoded), so search tools which can index the contents of html files will work.

There is also the option to have the contents compressed, and (a separate option) to keep the plaintext of the file uncompressed, which will likewise still allow indexing to work while saving space.

reply
kepano
7 hours ago
[-]
I built Obsidian Web Clipper to automate that process. It also allows you to save web pages as nicely formatted Markdown files with YAML properties even if you don't use Obsidian.

https://github.com/obsidianmd/obsidian-clipper

reply
tempestn
8 hours ago
[-]
I noticed a web clipper was just released for Obsidian last month. Maybe that'd cut down those two minutes for you.
reply
dSebastien
7 hours ago
[-]
Yes! The Obsidian Web Clipper is pretty neat. I just published an article about it: https://www.dsebastien.net/supercharge-your-knowledge-captur...
reply
yazantapuz
6 hours ago
[-]
I am using monolith to just save the whole page to disk.

https://github.com/Y2Z/monolith

reply
ironyman
7 hours ago
[-]
I do something similar but with Discord. I made a server accessible only by me, and I have a few different channels like work, life, music, ideas, etc. I also send all screenshots I take into a separate channel, and set up a chrome extension that sends whatever page I'm on as a link.
reply
jay_kyburz
7 hours ago
[-]
What if discord goes away. I would think you want the data local.
reply
efilife
7 hours ago
[-]
terrible idea. people get their discord accounts banned randomly without warning
reply
accrual
3 hours ago
[-]
Unfortunately it's not super easy to get data out of Discord either. Last I checked, one needs to carefully setup a bot then script the bot to download messages to CSV, etc., but if you're not careful with the account and bot setup, the export process itself could lead to a ban.
reply
immibis
6 hours ago
[-]
like recently they banned the entire country of germany by accident
reply
gameshot911
3 hours ago
[-]
How often do you reference your vault?
reply
Feathercrown
8 hours ago
[-]
Agreed. I think you could automate some of that too, could save time if you do it often.
reply
brainzap
5 hours ago
[-]
for the lazy, I think the web archive safari exports is standardised and gives you a good website backup.
reply
jeofken
8 hours ago
[-]
In my day browsers could save an archive of a page

Is this still the case?

reply
Macha
8 hours ago
[-]
They can but generally that includes any Javascript on the same page which sometimes does funny stuff when you open it up offline or after the remote server goes away.
reply
compootr
7 hours ago
[-]
SingleFile can make a snapshot with just content/styling
reply
accrual
3 hours ago
[-]
It's not perfect, but Edge will let one take a simple full page screenshot with Ctrl+Shift+S. It results in a hefty PNG but at least it's a visual copy of everything which might suffice for a certain set of purposes (e.g. links will be lost, so it's not good for that).

I can still right-click > Save any page as .html, but that doesn't guarantee server streamed stuff, media, images, etc. will be preserved correctly.

reply
800xl
3 hours ago
[-]
Thank you for this! I pressed Ctrl+Shift+S in Firefox just to see if it would work and it has the same functionality.
reply
smitelli
12 hours ago
[-]
> I got a picture of my great grandfather, thing took six hours to take your picture. [...] Every guy had one picture back then. And it's just him like, "[grimacing] I gotta get back, feed them hogs!" Now, in the future of course it'll be different. 50 years from now, people will be going like, "Hey! You wanna see a hundred thousand pictures of my great grandfather? I got 'em right here plus everything he did every day of his life." --Norm Macdonald[1]

There is certainly a quantity of stuff online that is absolutely worth saving, but there's a considerably larger proportion that's just redundant to the point of being unremarkable and pointless. The trick is filtering, which can be capital-H Hard. That's why some may want to err on the side of over-collecting to reduce the possibility of missing something that will actually be important someday.

[1]: https://www.youtube.com/watch?v=sY6SjMITHrQ

reply
diggan
10 hours ago
[-]
Yeah, this is a good point. Isn't it better we save too much, as tooling for filtering stuff out will always get better, rather than saving too little? The latter has no workaround (today at least).
reply
nytesky
12 hours ago
[-]
Another funny take from Macfarlan

Definitely no smiling:

https://youtu.be/8SslNMLO0tw

reply
don-code
9 hours ago
[-]
I DVR the nightly news with NextPVR, more as a convenience in case I'm doing something when it's on, want to pause/rewind, want to watch it the next morning instead, etc.

Come 2020, I was convinced that the world was going to end. So I simply... turned off the retention rule. One hour of news is around 5GB, but that's a very-high-bitrate MPEG-2 stream with an extra audio channel in Spanish. So I instead wrote a cron job to take that week's news, drop the stuff I don't care about, and H.264 the entire set of them down to 4.7GB, then burn them to a DVD for offline storage, since there's not much value to keeping them online.

By 2022, it was obvious the world was not, in fact, ending, but I never stopped this practice because of how simple it was, and how unobtrusive to store they are. I just make sure a fresh DVD is in the NAS every week, and put the DVDs on a spindle - they collectively take up about as much room as a toaster. I could make that even smaller and simpler if I opted for a portable hard drive.

Occasionally I'll manually toss something interesting in, like the presidential debates, or special coverage of some newsworthy event.

In 20 years, when it comes time to re-burn the earliest of them, maybe I'll make a value judgment on whether that's worth it, but for now it feels like I'd be losing something for not much of a good reason.

reply
accrual
3 hours ago
[-]
Reminded me of the story of Marion Marguerite Stokes who recorded TV news from 1977 to her passing in 2012.

https://blog.archive.org/2013/11/22/a-dream-to-preserve-tv-n...

reply
zelon88
6 hours ago
[-]
I was thinking the other day about the longevity of useless data. One idea that floated around in my head was self expiring emails.

I recently deleted about 40,000 emails. Most of them were identical, duplicate marketing emails. I was forced to do this to free up storage.

That's when I realized something. I am paying my email provider for the full price for every byte of "represented" data. In reality, their distributed file systems could compress an arbitrary number of copies of these emails and only consume the amount of space that one email consumes. So 100,000 duplicate emails on the server are consolidated into one representation of the data, but each customer has to pay for each byte that is represented.

The vendor stores a file once and charge full price every time they reproduce it for someone. If you have 10,000 copies of a file they only have to store it once but you will pay for every byte in all 10,000 copies.

reply
Scoundreller
3 hours ago
[-]
There were some early blog posts by the single person running mailinator.

Since they only stored text, they would make a single db entry for each unique line of text that came in and just made more and more references to that.

Even different emails… were mostly the same.

reply
password4321
5 hours ago
[-]
This is the Dropbox business model, especially when they encourage using their service to share files and it counts as space used in source and destination accounts.
reply
m463
1 hour ago
[-]
I think people don't get another perspective on this until someone dies.

Mostly it just goes away at death.

It might be interesting to read:

https://en.wikipedia.org/wiki/Digital_hoarding

I have trouble letting go of things, and I found it interesting to read through.

There's a part of me that thinks "It would be so useful to 100% automatically log and cache everything I do and be able to search it". But I think maybe being healthly means not doing it.

reply
montebicyclelo
9 hours ago
[-]
One approach to this is the SingleFile browser plugin [1], configured to save pages to a GitHub repository - it saves the whole web page as a single HTML file in the repo. (Ok it's probably closer to archiving than bookmarking... but it's not too far off)

[1] https://github.com/gildas-lormeau/SingleFile

reply
thefaux
9 hours ago
[-]
There are many things in life that have immense personal value and zero value to nearly everyone else. This creates a lot of misunderstanding and incentive misalignment.
reply
ozim
8 hours ago
[-]
Sounds about same what I was going to write.

Most likely it is not worth it. But people should not be doing only things that are “worth doing”. Then again if something brought you joy but was complete waste of time - it was worth it.

Hate dementors who tell you otherwise, it is limited life time but it is yours. You should be helpful to others but doing only “what is worth” suck the beauty out of existence.

reply
zimpenfish
8 hours ago
[-]
> zero value to nearly everyone else

Well, except future historians who may find value in "personal" information (although I guess we've got such a surfeit of recorded "personal" information these days compared to even just 50 years ago, it may not be quite as useful as when they find, e.g., some Babylonian tablet with a shopping list on. But you never know!)

reply
iamwil
8 hours ago
[-]
Yes. Sometimes when I'm doing research into recent history of why certain technical decisions were made, and the arguments for or against, I find archive.org invaluable for piecing a line of thought back then. Recently, this was to look up what the debate between React's Functional components vs Signals was.

Also, it's helpful to get perspective on the attitudes for or against a new technology in recent history. I remembered there were people that said "If you aren't writing a kernel, you don't have their problems, so you don't need git." Turns out that's not true. Now that git is everywhere, it's harder to remember whether or even if there was pushback against it.

This was written about the insights from using git that he needed to highlight to people back then. https://keithp.com/blog/Repository_Formats_Matter/

I often reference it, and if it wasn't still up, I'd have only web archive to rely on.

So for me, lots of stuff I look at online (mainly blog posts) are worth saving. Sometimes, if the discussion is on a twitter thread, that too. Which makes me fear for the day Microsoft decides to do Github in, and we'd lose all the issues and comments.

reply
pabs3
4 days ago
[-]
If you're interested in that sort of thing, come hang out with ArchiveTeam:

https://wiki.archiveteam.org/

reply
wintermutestwin
8 hours ago
[-]
There is a wealth of live performances on youtube that individuals have uploaded and that likely violate mpaa copyright crap.

IMO, this content is of high cultural value and I fear it won’t be long that the goog suffers us to watch “their” content without infecting it with ads.

I wish there was an easier way to self host this content with a way to organize and browse using tags.

reply
wwweston
8 hours ago
[-]
5 years ago I was working on a semi-novel crowdfunding platform that relied on video presentations. First iteration we used the YouTube API because hosting our own video seemed daunting and that worked fine for a bit. Over time we started to run into limits/errors/interruptions/audits at inconvenient times until one weekend I was like “screw it, let’s find out what the problems of self transcoding and hosting are.” Spent some time learning to use ffmpeg and throwing the results on our static resource pile. Tagging was a fairly straightforward lift. Honestly worked better than I’d have anticipated and was much less hassle. I’m sure we would have hit the problems if we’d reached a critical scaling point buuut that didn’t happen inside our year or so of having clients.
reply
tofof
9 hours ago
[-]
I've recently come back to a PC game (B-17 the Mighty Eighth) from 2000 that, quite unexpectedly, is getting a remaster and potentially a port to VR. It had a thriving community for several years, with many mods and guides and knowledge contained in the single dominant forum (bombs-away.net). When it shuttered, the vast majority of that information was lost. Old workarounds for bugs in the engine and detailed instructions of exactly how certain mechanics works are unavailable. One popular youtuber who continued playing through at least 2010 maintained a dropbox that had most of the mods that were ever available, but not the forum posts explaining them. So, for example, there's a mod that survives there to let you replace a generic 'sign on the dotted line' handwriting with your own - but gone are the instructions of exactly how to apply it.

When I had returned to the game after bombs-away.net had gone defunct, I posted my own personal archive to the GoG forum for the game. Now that I've returned to the Redux version I find my own files, with my personal notes, shared by a single other soul who had similarly maintained an archive, and apparently had collected mine at some point. I'm very glad to have helped preserve knowledge - but not everything of mine was there. Now that I've noticed the 2024 remaster effort and joined that community, I've been able to share files that were otherwise apparently completely lost - in particular, a set of images showing dimensions of certain common features in bombing targets, that allow estimating the total size of the target.

Unfortunately, my own personal archive included many forum topics that I just dragged off shortcuts to. I can see the old titles of the pages from the surviving shortcut files. I remember the questions I had (and now have again) that those shortcuts held the answers to. But because I didn't save the page itself, it's.. gone. That's immensely frustrating.

Yes, things are worth saving. Especially for topics with extensive information among a small niche audience that have a single point of failure. I've found an extension (SingleFileZ) that does a good job of archiving a web page with all embedded content into what's a zip file under the hood - so futureproof even if the extension disappears and it becomes difficult to simply open the file directly in browser.

EDIT - montebicyclelo mentions SingleFile, which apparently is a continuation of SingleFileZ, with new features. SingleFileZ already allowed automatically saving every visted page in a tab (or even among all tabs), batch archiving of a list of urls, etc, so presumably SingleFile has all these capabilities and more.

reply
jll29
7 hours ago
[-]
Any information created by humans is part of our "culture". You may consider it of no value, but someone else may beg to differ.

I went to a fantastic talk a few years ago at the British Library about digitizing a substantial quantity historic Australian newspapers. It was amazing to be able to read funeral announcements, product advertisements and other signals from the past showing us Australian culture from the 1800s.

Since we leave much less behind in terms of physical assets (personal letters, postcards, personal diaries), we should at least aspire to archive more from the digital realm, or to future historians we'd look like a blank century.

reply
stared
11 hours ago
[-]
I often find myself revisiting old posts and stories. As with any human artifacts, most things aren't worth revisiting or are only meaningful in the moment. If they're gone, few people miss them.

I'm a link hoarder myself (over 13k links on Pinboard: https://pinboard.in/u:pmigdal/). While I don't revisit most of them, some have proven invaluable for re-reading and sharing. I'm not sure about the typical half-life of internet content, but a lot disappears—whether because people stop paying for domains, official websites get reorganized (or their content removed), or other reasons.

This is where the Internet Archive steps in, doing the essential work of a digital librarian. I often share links from its Wayback Machine, which has been a link-saver more times than I can count.

reply
nilamo
12 hours ago
[-]
Personally, I like that the internet is ephemeral. It matches real life in that way. I would rather see the internet as a means of connecting people over large distances (across space, Mars, etc), maintaining 20,000 copies of every irrelevant thing is just silly.
reply
lxgr
11 hours ago
[-]
The problem is that not everything it has replaced was originally ephemeral.

In a the Internet is both too ephemeral (self-hosted blogs disappear, Youtube videos get taken down) and too persistent at the same time; I don't think that most Twitter posts of non-public figures would need to remain public forever by default, for example, and I don't think I need to mention various data breaches.

The Internet Archive somewhat mitigates the first issue, but it makes me pretty nervous that there's essentially just one organization doing what used to be much more distributed to various physical libraries.

For the second one, I hope we'll see better solutions (both technical and social) as the technology and our interactions with it mature.

reply
qwertox
12 hours ago
[-]
> Personally, I like that the internet is ephemeral.

It is not. It is only for us normal people. But the companies which log our lives in order to then capitalize on it, for them the internet is not ephemeral. They have copies of videos, pages, podcasts, whatever it is what can be found there.

Why would you want those companies to know more about yourself than you do?

reply
zamadatix
11 hours ago
[-]
Archive.org or Google can cache more of the internet than I do while still having the majority of the content be ephemeral.

I'd also hazard to guess most people in this camp would want these companies to also not store these things the same as they don't want people to.

reply
Barrin92
6 hours ago
[-]
>Why would you want those companies to know more about yourself than you do?

That's not a question of wants, companies will always know more about you than you, for the simple reason that even if you had all their data you have no means to extract any meaning from it. It requires immense organization and resources, increasingly so as the rate of data production increases.

For that reason the correct response isn't to engage in the same hoarding and privacy abuse of the companies, it's like bringing a knife to a tank fight, but to 1. make sure you don't produce that data to begin with through privacy protections and technical means and 2. create environments in which you have ownership of your data, instead of businesses.

reply
krick
9 hours ago
[-]
How do you backup websites? I mean, it sounds trivial, but I kinda still haven't figured out what is the way. I sometimes think that I'd like some script to automatically make a copy of every webpage I ever link in my notes (it really happens quite often that a blog I linked some years ago is no more), and maybe even replace links to that mirror of my own, but all websites I've actually backed up by now are either "old-web" that are trivial to mirror, or basically required some custom grabber to be writen by myself. If you just want to copy a webdpage, often it either has some broken CSS&JS, missing images, because it was "too shallow", or otherwise it is too deep and has a ton of tiny unnecessary files that are honestly just quite painful to keep on your filesystem as it grows. Add to that cloudaflare, Captchas, ads (that I don't see when browsing with ublock and ideally wouldn't want them in my mirrored sites as well), cookie warning splash-screens, all sorts of really simple (but still above wget's paygrade) anti-scraping measures, you get the idea.

Is there something that "just works"?

reply
wis
8 hours ago
[-]
For saving a webpage you have open, I use a browser extension called SingleFile, I've been using it for a while (IIRC I discovered it on HN's front page a few years ago), in my experience it "just works", works really well.

You click the "browser action" icon/button of the extension and it saves a single HTML file that looks exactly like the webpage you have open.

From its FAQ[1] on GitHub:

  # What does SingleFile do?
  SingleFile is a browser extension designed to help users save web pages as complete, self-contained files. The extension's primary function is to capture an entire web page, including its HTML, CSS, JavaScript, images, and other resources, and package them into a single HTML file.

  # I am a web archivist, is it ok to use SingleFile to archive content?
  No, SingleFile is not a tool used by professionals to archive content on the Web, especially in the academic field. Professionals prefer to rely on tools based on the WARC specification instead.
[1] https://github.com/gildas-lormeau/SingleFile/blob/master/faq...
reply
throw0101a
8 hours ago
[-]
> For saving a webpage you have open

There's also print-to-PDF that most OSes now have.

reply
wis
7 hours ago
[-]
Yeah, pretty much all browsers on all OSes have print-to-PDF/save-to-PDF, I prefer saving an HTML file over saving a PDF file for 3 reasons:

1. SingleFile allows me to save a an HTML file that looks exactly like the webpage I saved. I never used a save-to-PDF functionality in any browser that allowed me to save a PDF that looks exactly like the webpage I was saving/printing. I wish browsers implement that, somebody did that once, they patched chromium to save a web page as SVG[1], AFAIK if you can save to SVG you can also save to PDF with not much modification to the code, unfortunately the fork is not maintained anymore.

2. The HTML files that SingleFile creates are responsive (just like the webpage you had open), PDF is not responsive. I like that because it makes it easier to read the webpage I saved on my phone later, with a PDF file you saved on your desktop, you have to pinch to zoom and pan while you read it on your phone.

3. HTML-files/Webpages are accessible to screen readers and my browser's extensions work on them, extensions don't work on PDF files (they _can_ work on HTML files opened from disk, if you allow/enable it in the extension's settings).

[1] https://news.ycombinator.com/item?id=33584941

reply
rambambram
5 hours ago
[-]
I use WebScrapBook, an extension for Firefox. It seems to save a whole page in one file, and I can tweak a lot of the settings.

Sometimes I wonder if there's an even easier browser-builtin function that does the same?

reply
Dwedit
9 hours ago
[-]
There are extensions like "Save Page WE" that will dump the current state of the DOM to an HTML file, including CSS and Images, but these are static and don't make the scripting work.
reply
dehrmann
3 hours ago
[-]
This feels like confirmation bias to me. The author seemed to genuinely consider the question, but didn't think critically about how little value he got from two decades of bookmarking and instead focused on how he could use this archive in the future.
reply
og2023
3 hours ago
[-]
We have become so cloud-native (god forbid!). Just recently I realised that I can save an interesting page to my hard drive instead of saving its link. What a wonderful word has opened since! It's so liberating to live without all these bs tools.
reply
Macuyiko
6 hours ago
[-]
From an age perspective (but the crowd here will not like that): before I trusted myself I could always find it back so I don't need to save it. Now I can't anymore, but I don't care so much.
reply
greatgib
10 hours ago
[-]
Some times you have strange obsessions or a strange mindset related to your technological habits. And you might easily think that it is only you that is weird, not thinking straight. If you are the only one doing something, you are probably wrong.

And then, hopefully, there are nice personal blog posts like this one, showing you that you are not alone having some peculiar habits and so that it might make sense even if most people don't even think about it.

I have the exact same feeling when I discover through hn, blog posts and events that I'm not the only one having my web browsers full of tabs. Literally having thousand of tabs.

reply
neilv
7 hours ago
[-]
One thing that is worth saving is the PDF manuals for physical products that you own.

These sometimes disappear from the Web. Or disappear except for some third-party site that modifies and/or paywalls them.

Also, save the occasional important support info Web pages for those products. You'll know it when you see it. And if you don't save it now, it might be gone when you need it.

You don't need a fancy system for this. I just made a directory `~/doc/`, and started dropping files into it. Someday, I'll take the time to merge this with `~/wiki/`, but for now, I'm capturing the information with low friction, which is most important.

reply
Groxx
7 hours ago
[-]
And even when they don't disappear, they still end up dozens of weird pages deep that none of the on-site help text or search points to correctly due to the various pointless redesigns the site has gone through.

But hey, there's more whitespace now.

reply
jscottbee
8 hours ago
[-]
I created a local-only web app to wrap up some of my favorite web haunts, with HN being one of them. It allows me to look at the headlines, and save any of them in a locale SQLite db that the app maintains.

https://i.postimg.cc/v8znk92x/ycomb-hn.png.

reply
willjp
9 hours ago
[-]
This resonates so strongly with me. I worked a job where I needed to use outdated Microsoft toolchains to build plugins for software, and the documentation was just -- gone. Good luck. I've been almost compulsively saving the things that feel important to me, while seldom browsing them for years -- all the while hunting for a faster and more intuitive recall system that lets me find them later.

My ex, however had a much more fluid relationship with the internet and media in general. They liked new things, and didn't particularly care if they enjoyed something and it faded into obscurity. I feel like that's the winning mentality, but I just can't bring myself to embrace it.

reply
Viktoire
11 hours ago
[-]
When I save things, I try to make sure that it'll be immediately useful to me once I find it again.

I'll highlight, summarise and take notes of what I save. Or some combination of those. If I don't find anything new or directly applicable to my life, I'll let it pass by.

This approach isn't good for archival purposes, but I hesitate to save a lot of things that I'll never read again.

reply
ghaff
11 hours ago
[-]
I'm going through my file cabinets right now. I'll keep a few things that catch my eye but I'll likely throw out most of it. The odd 25 year old computer magazine is probably interesting but not all of them collectively for the most part. And I'm certainly not going to index them in a way that they'd be useful to me.
reply
galleywest200
10 hours ago
[-]
You can probably sell or donate those old magazines to a collector, or a kid interested in that stuff. At the very least drop them off at a thirft store instead of just dumping them.
reply
ghaff
10 hours ago
[-]
Thrift stores don't want a ton of old paper. There are a lot of things that someone somewhere would probably like but I'm not going to track them down or get them there. Mostly it's not magazines anywway. It's a bunch of articles I ripped out over the years.

The one thing I have in my garage I know someone would want is a big pile of laserdiscs. But, again, a thrift shop (or my library) wouldn't want them and I live pretty far out from a major city. Probably will try Craigslist post-winter though as I'm trying to declutter.

reply
buildsjets
5 hours ago
[-]
Laserdiscs appear and gradually disappear at my local thrift, so someone must be buying them. Now in the vinyl records pile, there are copies of Mantovani, Jim Nabors, and Herb Alpert which have been there for years, but anything classic rock or newer sells the same day.
reply
Falkon1313
1 hour ago
[-]
I was just thinking yesterday, wanting some Christmas music to get into the spirit while wrapping presents, remembering being a kid, when my mom would put on Jim Nabors' Christmas album.

Luckily there are (currently) multiple playlists of it on Youtube.

But they might not be there next year.

reply
ghaff
4 hours ago
[-]
In the spring I'll probably do take it or leave it for the whole collection on Craigslist for the whole pile at a nominal price and, if that doesn't work, just take it up to the local thrift and I'll at least have tried.
reply
profsummergig
4 hours ago
[-]
Instead of saving them as PDFs, I started saving web pages using a Chrome extension called Single File [1] (after testing it, of course).

To my dismay, some saved files (.htm extension) didn't open when I wanted to open them.

So I'm glad people are discussing ways to archive web pages while that reproduce the original page faithfully.

[1] https://chromewebstore.google.com/detail/singlefile/mpiodijh...

reply
deskr
8 hours ago
[-]
"stuff online" is an exceptionally course filter to deem something worthy of saving.
reply
paulcole
3 hours ago
[-]
I’m the opposite of most of the “archivists” on HN. I delete everything and save nothing. I have maybe 25 sheets of paper in my apartment, including social security card and birth certificate.

Saving stuff just isn’t fun or useful for me. Never for more than a passing moment have I thought, “Boy I wish I had saved that whatever.”

Old people are the worst about this stuff. They think/hope somebody will want it and then just make it the next generation’s problem.

I told my dad if he thinks it has value, give it away while he’s alive. I have neither the interest nor the space to deal with it so it’s going straight into the trash.

reply
asimpletune
13 hours ago
[-]
It reminds me of the cool links page I see now and then.
reply
mxuribe
9 hours ago
[-]
Is this the classic webpage that you're referring to? https://www.w3.org/Provider/Style/URI
reply
btbuildem
9 hours ago
[-]
I think some stuff is -- the stuff that is crucial to rebuilding all the other stuff.
reply
mediumsmart
6 hours ago
[-]
I used to think so and then I ran out of space
reply
RajT88
11 hours ago
[-]
Stuff online is absolutely worth saving. It is a window into the past - what people concerned themselves with, what they loved and hated.

Scholars will write papers on this era, speculating what it was like and how it fit into what came after.

The web documents the massive societal changes underway which do not relate to the internet directly. Things like changes in transportation technology, medicine, sexuality and gender, and how your average people felt about all of it. Scholars will data mine those opinions to understand who felt what ways and why, with the benefit of hindsight. New knowledge will come of it.

So yeah! It is all worth saving.

reply
underseacables
13 hours ago
[-]
I suppose it comes down to what the purpose of such archiving is.

I think it's the preservation of information, but I also believe 90% is absolutely pointless. There is just so much of it, and data storage so cheap, that it makes sense to just save everything.

reply
dreamcompiler
12 hours ago
[-]
That data storage is also ephemeral. Nobe of it will last as long as a paper note, unless some human goes to the trouble of copying it all onto new drives with new software every ten years or so.
reply
Atreiden
12 hours ago
[-]
With a proper NAS and RAID10 for double parity, it's a bit like Theseus ship. Just keep swapping out drives when they become unhealthy and you never have to rebuild or migrate
reply
ninalanyon
11 hours ago
[-]
Eventually the controller will die and eventually compatible ones will no longer be produced or will at least be inconvenient to obtain or commission and hence expensive.

Paper lasts for centuries without any attention beyond keeping it moderately dry and away from things that eat it.

reply
emptiestplace
11 hours ago
[-]
No sane person uses hardware RAID in 2024, if that's what you're referring to.
reply
zamadatix
11 hours ago
[-]
Whether you're using hardware RAID or not you still need a hardware storage controller of some type which accepts the new disks you can buy and works with the NAS. What they are saying is eventually that'll be more $ and time than just migrating off the system would be. From ENIAC to now could fit in one lifespan, would you still be maintaining a home floppy drive backup system in the 2040s or just save the time and effort with a migration?
reply
jpalawaga
6 hours ago
[-]
sure, you can always move the old storage mechanism to something new if it is too cumbersome.

why still back up floppies if you could just move the data to a single dvd, or throw is on the SAN?

RAID is just algorithms, the actual transport doesn't matter (i.e. spinning platter and solid state both use SATA connectors).

reply
danielbln
12 hours ago
[-]
Data rots though, you can't just save it once and be done with it. You have to migrate it across storage mediums, formats etc. It's a recurrent effort/cost.
reply
bdhcuidbebe
12 hours ago
[-]
More planning for less effort.

Do your research first. Use standards

Eg: html, pdf, h264/h265/av1 in mp4 container, chd, zip and so on depending on what you are storing.

reply
HeatrayEnjoyer
10 hours ago
[-]
On what physical medium?

I have 1 terabyte of data in 1860, how do I make sure the storage medium is still intact in 2024?

reply
TacticalCoder
5 hours ago
[-]
> I have 1 terabyte of data in 1860, how do I make sure the storage medium is still intact in 2024?

Storage keeps growing and price of storage keeps doing down.

My DOS and even some C64 source code made it to this day on backups (DVDs, HDDs, SSDs, USB memory sticks, etc., both online and offline) and to ZFS pools. Medium that didn't exist in the 80s/early 90s.

Floppy disks -> 40 MB HDD -> 6.4 GB HDD -> 80 GB HDD -> 500 GB HDD -> 240 GB SSD -> 1 TB NVMe SSD.

You get the idea.

The way you get sure you still have your data is by not focusing on the medium but by focusing on the fact that data is data.

Medium comes and goes. Data can (and should) be copied to new medium.

Not unlike:

    /home/pub/backups/oldBackups/DOSbackups/...
    ...Conner80MBHDDbackups/backups/oldBackups/Commodore64backups/...
Some people are going to complain about the naming but I have all my emails except for six months back since I started using the Internet. And I still have all nearly a lot of my data since I started using computers. 8-bit computers.

Do you?

I don't care about naming much. "search, don't sort".

We've got emulators for just about every and any system. My vintage arcade cab has both real PCBs and a Pi running an emulator with thousands of arcade games on it.

You can already, today, emulate, say, the Raspberry Pi model you want using QEMU. There are container file that'll gladly do that for you.

Unless civilization ends there's simply a not a world in which, say, PNG, JPG and x265 files aren't readable. This just won't happen.

FWIW I'm paranoid integrity of my data: I've got my own naming scheme where a cryptographic hash is added to many of my files.

For example:

        DSC_91394-b3-ae4f2877d3.jpg
This means "This file's Blake3 checksum begins with ae4f2877d3".

I then have a script doing statistical sampling: I enter a percentage and that percentage of files where a cryptographic hash is part of the filename are checked, randomly (if I enter 100 then 100% of the files are tested).

If I enter for example '7', then 7% of the files are tested and then there's high probability all checksums are correct.

> On what physical medium?

That is the wrong question.

reply
sigio
12 hours ago
[-]
Well... storage is cheap, but not cheap enough to save everything, with just usenet being in the 400TB/day range these days. Sure, it's cheap enough to save every webpage you visit during your life, but probably not cheap enough to save every video you click on youtube or watch on a streaming-service, and all the music you listen to all day.

Though just the music compressed in opus at 128kbit might work ok, 60 years of 24/7 128kbit is 30TB, so that would fit on 1 large HDD currently.

reply
saulpw
10 hours ago
[-]
Music is actually an ideal candidate. I don't listen to music all day, and when I do listen to it, it's often something I've listened to before. My current collection is about 200GB and that includes a ton of stuff I've never listened to; it seems reasonable that a full life's worth of music could fit in 1TB, easily.
reply
add-sub-mul-div
8 hours ago
[-]
If that much data comes across Usenet daily then how do services afford the storage to offer years of retention?

You can't dedupe the large binary files because they're encoded in small parts likely differently every time they're posted.

reply
renewiltord
9 hours ago
[-]
In general, I am pro-turnover where there is rivalry: ceteris paribus keep the newer thing. However, information is so cheap as to be effectively non-rivalrous so I am considering running my own archival and to keep kagi's small sites etc. alive. Unfortunately, there is not a good tool for this that matches whatever Archive.org has. ArchiveHub needs routine management to keep the feed up and viewing it is not that easy. I'm sure we'll come up with stuff.

The other thing is that searching for the long tail is near impossible. The big sites dominate Google, so I need something like marginalia to actually get to the old stuff that it used to be so easy to find. Because of the median user having simple queries, some questions are no longer answerable on Google: they are dominated by the median user and never show up.

reply
swayvil
11 hours ago
[-]
Curve smoothing. Chaikin's algorithm and Jarek's tweak etc. Very clever and nice way of making angular geometry curvy. Constructive geometry stuff.

There were like a dozen algs. I kept links to nice papers with diagrams. Then they started disappearing. Now I'd be pressed to find 2.

This is really useful info that is apparently disappearing. So yes, it happens, and maybe you should save that stuff.

reply
paulpauper
11 hours ago
[-]
Digital storage is free; yes, save it all
reply
lxgr
11 hours ago
[-]
Please do share where I can reliably store my backups for free!
reply
fragmede
11 hours ago
[-]
> Backups are for wimps. Real men upload their data to an FTP site and have everyone else mirror it.

— Linus Torvalds

reply
LinuxBender
10 hours ago
[-]
This does still happen. Microsoft may nuke a git repo and someone has to figure out who has the latest version of the entire repo with all the latest commits of every branch.
reply
theandrewbailey
10 hours ago
[-]
The vast majority of people aren't privileged enough to have anyone mirror their data.
reply
lxgr
10 hours ago
[-]
But how do I get everyone to mirror my gigabytes of encrypted photo backups?
reply
Falkon1313
1 hour ago
[-]
Title it "(current star name) Leaked Nudes.zip" and seed a torrent? Every few years, change the title to keep it current.
reply
paulpauper
9 hours ago
[-]
just upload them to social media accounts. Afik twitter, facebook, and youtube do not have storage limits . no deletion for inactivity either.
reply
lxgr
5 hours ago
[-]
They don't allow uploading large binary blobs either, though, and steganographically storing gigabytes of data with probably terabytes of overhead sounds like a quick way to get banned.
reply
paulpauper
9 hours ago
[-]
dump it on Wikipedia. afik wiki never removes anything. it just gets buried in an edit history . or Wikimedia image files
reply
lxgr
5 hours ago
[-]
That obviously can't be true, or spammers would be all over it, using Wikimedia as a free image host.
reply
impure
9 hours ago
[-]
The rise of LLM’s has really devalued saving stuff online. What is the point of saving an article if I could just ask ChatGPT to created it and would probably do a pretty good job? It’s still worth keeping notes and stuff that may be hard to find but the majority of things online can easily be reproduced and are not worth saving.
reply
vouaobrasil
9 hours ago
[-]
I think you are right. But I think the answer goes deeper: we have encouraged a culture where the most supported information is also the most superficial. The essence of individual experience itself has long been discouraged on the web in favour of SEO and the trashy news and the trivial.

So the fact that ChatGPT can replace much of the web actually says less about the marvel of ChatGPT and more about the lack of anything really worthwhile because the profound just happens to be the least economically valuable.

reply