FilterHN

dr_dshiv

1 month ago

[-]

> Li correctly points out that the Archive's budget, in the range of $25-30M/year, is vastly lower than any comparable website: By owning its hardware, using the PetaBox high-density architecture, avoiding air conditioning costs, and using open-source software, the Archive achieves a storage cost efficiency that is orders of magnitude better than commercial cloud rates.

That’s impressive. Wikipedia spends $185m per year and the Seattle public library spends $102m. Maybe not comparable exactly, but $30m per year seems inexpensive for the memory of the world…

AdamN

1 month ago

[-]

I think the culture is one of 'we are doing this for all humankind' and when you get just a few smart people bought in on that level of commitment and they're trying to be lean (and also for sure underpaying themselves compared to what they might make at Big Tech) then you can get impressive results.

sshine

1 month ago

[-]

I look at the 1990s picture of Brewster Kahle and think: He surely didn't get paid as much as me, but what did I do? Play insignificant roles in various software subscription services, many of which are gone now. And what did he do? Held on to an idea for decades.

The combined value of The Internet Archive -- whether we think just the infrastructure, just the value of the data, or the actual utility value to mankind -- vastly outperforms an individual contributor's at almost every well-paying internet startup. At the simple cost of not getting to pocket that value.

I wish I believed in something this much.

1 month ago

[-]

If you think that's fucked up, do you know how little we pay teachers? Especially preschool-K? Clearly money is just a metric for how much moneying the money had been able to money. Goodhart out it another way: "When a measure becomes a target, it ceases to be a good measure.

toomuchtodo

1 month ago

[-]

1k teachers in Arizona have quit in the last six months because of this.

Over 1,000 Arizona teachers resigning plays a part in shortage - https://news.ycombinator.com/item?id=46728151 - January 2026

sshine

1 month ago

[-]

I was a CS teacher for the past two years, so yes. I did it for quality of life reasons while my son learned to walk. But I almost doubled my salary going back to being a software dev.

miki123211

1 month ago

[-]

You can trade off cloud costs for developer time.

AWS is priced as if your alternative was doing everything in house, with Silicon Valley salaries. If your goal isn't "go to market quickly and make sure our idea works, no matter the cost", it may not be the right fit for you. If you're a solo developer, non-profit, or another organization with excess volunteer time and little money, you can very often do what AWS does for a fraction of the cost.

storystarling

1 month ago

[-]

I've found that for data-intensive workloads it isn't just a trade-off—the markup on egress and storage often makes the business model mathematically unviable. I'm bootstrapping a service with heavy image generation and the unit economics simply don't work on AWS.

exe34

1 month ago

[-]

aren't we told all the time though, that a board of directors beholden to shareholders and a god given edict to make numbers go up are the only way to do things efficiently, to be lean and productive? are you telling me that when people find there's a need for something to happen, they make it happen? for the good of mankind? no billionaires?

zozbot234

1 month ago

[-]

Wikipedia is not a pure hosting operation, it's trying to foster a worldwide community-of-practice of volunteer contributors that can be sustainable in the long term, and that does take quite a bit of spending. I have no idea why so many people keep getting this wrong.

swores

1 month ago

[-]

> "I have no idea why so many people keep getting this wrong."

To me it seems a perfectly natural effect of nearly everyone using it as a website which holds lots of information, and very few people comparatively have any experience with the community side, so people assume that what they see is what Wikipedia is.

Not many people are spending time reading reports on organisation costs breakdowns for Wikipedia, so the only way they'd know is if someone like you actively tells them. I personally also assumed server costs were the vast majority, with legal costs a probable distant second - but your comment has inspired me to actually go and look for a breakdown of their spending, so thanks.

Edit: FY24-25, "infrastructure" was just 49.2% of their budget - from https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_...

miki123211

1 month ago

[-]

Wikipedia is also uniquely cacheable.

I suspect that 95+% of visits to Wikipedia don't actually require them to run any PHP code, but are instead just served from some cache, as each Wikipedia user viewing a given article (if they're not logged in) sees basically the same thing.

This is in contrast to E.G. a social network, which needs to calculate timelines per user. Even if there's no machine learning and your algorithm is "most recent posts first", there's still plenty of computation involved. Mastodon is a good example here.

1 month ago

[-]

The move away from "most recent posts first" is because that's actually harder at scale than the algorithmic timeline.

wpietri

1 month ago

[-]

As a former Wikipedia admin, I think the best way to think of it as a massive text-focused battle MMORPG that happens to produce an encyclopedia as a side effect.

zozbot234

1 month ago

[-]

Yep, the encyclopedia is the not-so-wasteful "proof of work" part of the MMORPG. It's a game, but you grind it by working on generally useful stuff.

unixhero

1 month ago

[-]

Haha and with battles in the form of massive flame wars?

zozbot234

1 month ago

[-]

> holds lots of information

But they want that information to be at least kept up to date and hopefully to improve over time, right? That's what the community is for. It's not a free lunch.

swores

1 month ago

[-]

I wasn't insinuating any sort of judgement, from myself or from the vague general public that I referred to; just commenting on which parts are particularly visible & thought about.

Edit: I wasn't going to say anything, but then noticed you're the same person I was replying to before, so I will since it's more than once - in both your comments you seem to feel that you need to defend Wikipedia but in both cases there was nobody attacking them :)

I appreciate that internet comments can often contain lots of hostility, but I encourage you to remember that it's not a default state, and that often comments are just good faith opinions without an angry subtext. In both cases you could have just written as if adding some interesting information, rather than as if you're countering an anti-Wikipedia campaign. (And I'm not trying to attack or criticise now either, sorry if it comes off that way - just constructive feedback!)

B1FIDO

1 month ago

[-]

The Wikimedia Foundation is a full-fledged cloud services provider. They host applications and developers on their cloud platform. These developers have been working with AI and scripted solutions for a long, long time. ClueBot is the premier example of an AI- (ML)- powered solution to combat vandalism.

So Wikipedia is not merely a "cloud app with cloud storage" but it is a first-class cloud-based platform: the English project is merely the largest and best-known, but there are hundreds, hundreds of other projects hosted on WMF's cloud services. And the developers and the bot operators who run in the backend are hardly detectable by the end-users or even the everyday editors, but they are also the backbone of WMF services, and they are supported by WMF admins and developers, to run their applications that support editors and wiki admins in their duties.

vern001

1 month ago

[-]

I love libraries and museums, but I think that Internet Archive has done an incredible job.

If I didn’t have a job or responsibilities and was told that I was allowed to just be curious and have fun, I would spend a tremendous amount of time just reading, listening, watching, playing, etc. on IA.

Visiting IA is the closest feeling I can get to visiting the library when I was young. The library used to be the only place where you could just read swaths of magazines, newspapers, and books, and also check out music- for free.

Also, I love random stuff. IA has digitized tape recordings that used to play in K-Mart. While Wikipedia spends time culling history that people have submitted, IA keeps it. They understand the duty they have when you donate part of human history to them, instead of some person that didn’t care about some part of history just deleting it.

IA is not just its storage and the Wayback machine, even though those things are incredible and a massive part of its value to humanity. It’s someone that just cares.

At the end of the day, big companies just need to make profit. Do big companies care about your digitized 8-track collection you have in cloud storage? One day maybe they will take it away from you to avoid a lawsuit or to get you to rent music from them.

And your local NAS and backups? Do you think your niche archive will survive a space heater safety mechanism failure, a pipe bursting, when your house is collateral damage in a war, or your accidental death? I understand wanting to keep your own copies of things just-in-case, but if you want those things to survive, why not also host them at IA if others generally would find joy or knowledge from them?

1 month ago

[-]

My lil NAS won't survive, but do you also believe the IA's San Francisco office will survive"the big one" when it hits the San Andreas fault? Geographically redudndant storage is the only way to do it, and that goes goes for installations big and small.

CursedSilicon

1 month ago

[-]

IA has redundant backups in Europe

https://web.archive.org/web/20090219172931/https://blogs.msd...

dpedu

1 month ago

[-]

I'm surprised no-air-conditioning datacenters aren't more common. It's a huge cost, and people love to complain about related water usage. I recall some Microsoft employees running a similar experiment years ago:

delusional

1 month ago

[-]

I don't think it's really fair to compare IA to a real library. The Seattle public library for example spends 76% of their operating budget on employees, most of who are doing public services work. The second major expense for a real library is paying for books and materials, again IA doesn't do any of that.

It's not fair to compare an institution with a website.

entangledqubit

1 month ago

[-]

I thought the comparison was unfair as well.

Physical libraries also tend to be the defacto life help desk for a lot of people out there.

mc32

1 month ago

[-]

They’re both institutions but one wants recognition and nice buildings the other wants to be an immutable archive unlike Wikipædia which curates and memory holes issues that don’t align with its thinking. The other one just marches on without flashy managers at the helm making life easy for themselves.

buildbot

1 month ago

[-]

Seattle public library is also an archive as well as a provider of many beautiful and free third spaces. The downtown library is very cool. I bet there’s stuff in the stacks there that is not digitized anywhere.

komali2

1 month ago

[-]

The nice buildings provide a public space sheltered from the elements to millions of people a year. I love the IA but it really isn't a worthy comparison.

bakugo

1 month ago

[-]

> Wikipedia spends $185m per year

Only a small fraction of that is spent on actually hosting the website. The rest goes into the pockets of the owners and their friends.

You can do a lot with very little if your primary goal isn't to enrich yourself.

Atreiden

1 month ago

[-]

Do you have a source for that?

Being a 503c, they're required to disclose their expenditures, among other things. CN gives them a perfect score, and the expense ratio section puts their program spend at 77.4% of the budget https://www.charitynavigator.org/ein/200049703#overall-ratin...

Worth mentioning that Wikipedia gets an order of magnitude more traffic than the Internet archive.

https://wikimediafoundation.org/annualreports/2023-2024-annu...

dpedu

1 month ago

[-]

In their latest available annual report, the Wikimedia Foundation reported that in 2024 they brought in $185M in revenue/donations, of which they spent $178M. Of that $178M, $106M was spent on salaries and benefits, and $26M on awards and grants. So, that accounts for 75% of their spending. "Internet hosting" is listed at only $3M though there are other line items such as "Professional service expenses" at $13M that probably relate to running Wikipedia too.

Scroll down to the "Statement of activities (audited)" section:

rrr_oh_man

1 month ago

[-]

> $106M was spent on salaries and benefits

…across 650 employees, which is $166K on average.

[1] https://en.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

bakugo

1 month ago

[-]

> Worth mentioning that Wikipedia gets an order of magnitude more traffic than the Internet archive.

With an order of magnitude less data to host, though. The entirety of Wikipedia is less than 1PB [1], while the entirety of IA is 175+ PB [2].

Traffic is relatively cheap, especially for a very cache-friendly website like Wikipedia.

[2] https://archive.org/about/

https://wikimediafoundation.org/who-we-are/financial-reports...

kingstnap

1 month ago

[-]

Wikipedia's actual hosting is not expensive and never has been.

If you look at the audited financial report of last year.

$3,474,785 was spent on hosting. Which makes sense its basically a static site.

This is out of expenses of $190,938,007

Thats about 1.8%. This is not new. Its been the case for years. Wikipedia has never had very high hosting costs. Its always been going into their grants or whatever else.

Despite the nonsense about AI overloading their servers even if it doubled the load it would barely affect the budget.

phatfish

1 month ago

[-]

My countdown to donating to Wikipedia when a random MAGA nerd makes some baseless claims is getting close. When Elon had his little rant a couple of years ago it got triggered as well.

bakugo

1 month ago

[-]

A fool and his money are soon parted.

1 month ago

[-]

This is very cool. One thing I am curious about is the software side of things and the details of the hardware. What is the filesystem and RAID (or lack of) layer to deal with this optimally? Looking into it a little:

* power budget dominates everything: I have access to a lot of rack hardware from old connections, but I don't want to put the army of old stuff in my cabinet because it will blow my power budget for not that much performance in comparison to my 9755. What disks does the IA use? Any specific variety or like Backblaze a large variety?

* magnetic is bloody slow: I'm not the Internet Archive so I'm just going to have a couple of machines with a few hundred TiB. I'm planning on making them all a big zfs so I can deduplicate but it seems like if I get a single disk failure I'm doomed to a massive rebuild

I'm sure I can work it out with a modern LLM, but maybe someone here has experience with actually running massive storage and the use-case where tomorrow's data is almost the same as today's - as is the case with the Internet Archive where tomorrow's copy of wiki.roshangeorge.dev will look, even at the block level, like yesterday's copy.

The last time I built with multi-petabyte datasets we were still using Hadoop on HDFS, haha!

adrian_b

1 month ago

[-]

For a few hundred TiB, LTO-9 magnetic tapes (18 TB per cartridge) become cheaper than hard disks, despite the huge cost of the tape drive (between $4500 and $5000, but in many places greater prices are shamelessly requested, which should not be accepted; besides a tape drive you need a SAS HBA card and appropriate cables; thus you need to amortize about $5000 through the price difference between HDDs and tapes, to reach the threshold where using tapes becomes cheaper), and they become cheaper and cheaper the greater is your total amount of data, when the cost of the tape drive is amortized over more tapes. One may choose tapes for better reliability even for amounts of data that are insufficient to amortize the initial cost for tape drive + SAS HBA card + SAS cables.

This is especially true when you take into account that regardless whether you use HDDs or tapes, you should better duplicate them and preferably not keep the copies in the same place.

The difference in cost between tapes and HDDs becomes significantly greater when you take into account that data stored on HDDs must be copied on new HDDs after a few years, due to the short lifetime of HDDs. The time after which you may need to move data on new tapes is not determined by the lifetime of tapes (guaranteed to be at least 30 years) but by the obsolescence of the tape drives for a given standard, and it should be after at least 10 to 15 years.

If you keep on a SSD/HDD a database of the content of the tapes, containing the metadata of the stored files and their location on tapes, the access time to archived data is composed of whatever time you need for taking the tape from a cabinet and inserting it into the drive, plus a seeking time of around 1 minute, on average.

Once the archived data is reached, the sequential transfer speed of tapes is greater than that of HDDs.

LTO-9 cartridges have a significantly lower volume and weight than 24-TB HDDs (for storing the same amount of data), which simplifies storage and transport.

1 month ago

[-]

Ah I see. Thank you. I think to get the sort of functionality I want for the amount of effort I'd have to put in, I'd have to get a tape autoloader and so on. My cabinet is not conveniently located for me so I can't be swapping tapes and so on. I can see that they are quite good for long term offline storage. It's just that I have more of an online backup type target.

Datagenerator

1 month ago

[-]

You might want to look into using cephadm to setup CEPH. Use Erasure coding as data pool for very efficient data storage and protection (8+2). From that export large RBD to be used as zpool with dedup. Scales to Petabytes and has lots of failure protection options.

1 month ago

[-]

Thank you for that recommendation. I think I'm probably too small time for the moment for Ceph considering that. I don't have multi-petabytes (yet or perhaps ever).

xyzzy123

1 month ago

[-]

Not a pro data guy but someone running something like what you're talking about for many years. These days 200TiB is "normal storage server" territory, not anything exotic. You can just do the most boring thing and it will be fine. I'm just running 1, tho. The hard parts are having it be efficient, quiet and cheap which always feels like an impossible triangle.

Yeah, resilvers will take 24h if your pool is getting full but with RAIDZ2 it's not that scary.

I'm running TrueNAS scale. I used to just use Ubuntu (more flexible!) but over many years I had a some bad upgrades where kernel & zfs stopped being friends. My rack is pretty nearby so for me, a big 4U case with 120mm front fans was high priority, it has a good noise profile if you replace with Noctuas, you get a constant "whoosh" rather than a whine etc.

Running 8+2 with 24tb drives. I used to run with 20 slots full of old ex-cloud SAS drives but it's more heat / noise / power intensive. Also, you lose flexibility if you don't have free slots. So eventually ponied up for 24tb disks. It hurt my wallet but greatly reduced noise and power.

  Case: RM43-320-RS 4U

  CPU: Intel Xeon E3-1231 v3 @ 3.40GHz (4C/8T, 22nm, 80W TDP)
  RAM: 32GB DDR3 ECC
  Motherboard: Supermicro X10SL7-F (microATX, LGA1150 socket)
    - Onboard: Dual Intel I210 1GbE (unused)
    - Onboard: LSI SAS2308 8-port SAS2 controller (6Gbps, IT mode)
    - Onboard: Intel C220 chipset 6-port SATA controller

  Storage Controllers:
    - LSI SAS2308 (onboard) → Intel RES2SV240 backplane (SFF-8087 cables)
    - Intel C220 SATA (onboard) → boot SSD

  Backplane:
    - Intel RES2SV240 24-bay 2U/3U SAS2 Expander
    - 20× 3.5" hot-swap bays (10 populated, 10 empty)
    - Connects via Mini SAS HD SFF-8643 to Mini SAS SFF-8087 Cable, 0.8M x 5

  Boot/Cache:
    - Intel 120GB SSD SSDSC2CW120A3 (boot drive, SATA)
    - Intel Optane 280GB SSDPED1D280GA (ZFS SLOG device, NVMe)

  Network:
    - Intel 82599ES dual-port 10GbE SFP+ NIC (PCIe x8 add-in card)

It's a super old box but it does fine and will max 10Gbe for sequential and do 10k write iops / 1k random read iops without problems. Not great, not terrible. You don't really need the SLOG unless you plan to run VMs or databases off it.

I personally try to run with no more than 10 slots out of 20 used. This gives a bit of flexibility for expanding, auxiliary pools, etc etc. Often you find you need twice as much storage as you're planning on directly using. For upgrades, snapshots, transfers, ad-hoc stuff etc.

Re: dedup, I would personally look to dedup at the application layer rather than in the filesystem if I possibly could? If you are running custom archiving software then it's something you'd want to handle in the scope of that. Depends on the data obviously, but it's going to be more predictable, and you understand your data the best. I don't have zfs de-dup turned on but for a 200TiB pool with 128k blocks, the zfs DDT will want like 500GiB ram. Which is NOT cheap in 2026.

I also run a 7-node ceph cluster "for funsies". I love the flexibility of it... but I don't think ceph truly makes sense until you have multiple racks or you have hard 24/7 requirements.

1 month ago

[-]

Very cool. Okay, I think you're right. Doing dedupe at the application layer is a much better idea. I do have 512 GiB of DDR5 (it's an Epyc 9755-based server) but I think you're right because I am fully aware of the data I'm storing (internet archive data) so I can simply delta-code on a per webpage sense.

Right, I knew from /r/homelab that many normal people now store petabytes in their nodes. My specific machine is going to be in a DC located some 1 hr from me so I don't mind noise, but I am particular about power consumption and so on.

Based on what you said I'm going to run RAIDZ2 on this. I happen to have a bunch of EXOS 18 TiB drives so I shall use those. Thank you for the advice from experience!

genewitch

1 month ago

[-]

a couple hundred TB arranged how? and for what purpose, generally? archival, warm, hot?

for the first two, depending on throughput desired, you can do with spinning rust. you pick your exposure, single platter or not, speed or not, and interface. And no fancy raid hardware needed.

I've had decent luck with 3+1 warm and 4+1 archival. if you don't need quick seeks but want streaming data to be nice, make sure your largest file fits on a single drive, and do two parity disks for archive, a single for warm. md + lvm; ext4 fs, too. my very biased opinion based on tried everything and am out of ideas, and i am tired, and that stuff just works. I am not quick to the point but you need to split your storage up. use 18+ SMR disks, shingled magnetic recording hard drives, for larger stuff that you don't need to transfer very fast. 4k video for consumption on a 4k televsion fits here. Use faster, more reliable disks for data used a lot, &c

Hot or fast seeks & transfers is different, but i didn't get the idea that's what you were after. Hadoop ought be used for hot data, imo. People may argue that zfs of xfs or jfs or ffs is better than ext4, but are they gunna jump in and fix it for free when something goes wrong for whatever reason?

sorry, this is confusing. Unsure how to fix that. i have files on this style system that have been in continuous readable condition since the mid 1990s. There's been some bumps as i tried every [sic] other system and method.

TL;dr to scale my 1/10th size up, i personally would just get a bigger box to put the disks in, and add an additional /volumeN/ mountpoint for each additional array i added. it goes without saying that under that directory i would CIFS/NFS share subdirectories that fit that array's specifications. again, i am just tired of all of this, i'm also all socialed out so, apologies.

1 month ago

[-]

I suppose archival is as close to realistic as possible. It's intended for a personal Internet archive of a subset of sites I wish to crawl and store. I will query the data rarely, but I intend to store recent updates and so on. I have lots of CMR disks so I intend to use those. I intend to use zfs and I'm hoping I can add more disks later to the pool. What is your experience in gradually growing your storage and having to resilver? Do you just create new volumes?

genewitch

1 month ago

[-]

i don't use zfs, so i am unsure. I have associates that manage ZFS stuff, but nothing at this scale. I'm sure zfs for your use case will be just fine, though. I mean archival and not heavy cache / deletions / etc.

I won't belabor my love for ext4 and basic tools :-)

1 month ago

[-]

Haha, no no. If ext4 is working fine for you that's great. I mentioned zfs because I hope to be able to expand a zpool with more drives and so on. Please do share if you have done things like that with LVM + ext4 or something like that.

genewitch

1 month ago

[-]

yes, lvm lets you extend by, for example, adding physical volumes to volume groups, which then let you expand your logical volume (like /volume1 mountpoint) - i checked man lvm for lvm2 on devuan and that confirmed my memory.

mrexroad

1 month ago

[-]

> This "waste heat" system is a closed loop of efficiency. The 60+ kilowatts of heat energy produced by a storage cluster is not a byproduct to be eliminated but a resource to be harvested.

Are there any other data centers harvesting waste heat for benefit?

https://www.twobirds.com/en/insights/2024/germany/rechenzent...

cloud-oak

1 month ago

[-]

The EU mandates that all large data centres built/commissioned from July this year will make use of waste heat:

londons_explore

1 month ago

[-]

Sounds like part of the reason all the biggest AI data centers are being built outside the EU...

Dr4kn

1 month ago

[-]

It's more like there are a lot of building restrictions and fines. Overloading the local power systems. Building illegal turbines.

If you can get paid on your waste heat why wouldn't you like that?

1 month ago

[-]

"can" and "must" are two different situations.

miduil

1 month ago

[-]

Yes, plenty - sometimes data centers are built together with apartment or office complexes for this particular purpose. Unfortunately that already pinpoints the core limitation, due to the low-temperature of the data centers. The higher the temperature difference is, the more affective heating becomes - with air cooled systems it requires preparation to ensure that can be used for heating.

Also data centers need physical space, and often - you need heating where there is not a lot of space (cities), and for "district heating" you need higher temperatures usually.

https://www.euroheat.org/dhc/knowledge-hub/datacentre-suppli...

stanac

1 month ago

[-]

Yandex had a data center in Finland,, not sure if it's still operational. It was heating 1500 homes with 4 MW.

https://datacentremagazine.com/data-centres/excess-data-cent...

bilegeek

1 month ago

[-]

Several swimming pools do to great effect:

https://www.bbc.com/news/technology-64939558

https://arstechnica.com/information-technology/2023/03/free-...

stingraycharles

1 month ago

[-]

I know that ~ 15 years ago we were already using datacenters in The Netherlands that were used to heat houses in a city.

I do vaguely remember that the economics of it all were not great, but it’s definitely a thing for quite a while already.

ranger_danger

1 month ago

[-]

I was hoping an article about IA's storage would go into detail about how their storage currently works, what kind of devices they use, how much they store, how quickly they add new data, the costs etc., but this seems to only talk about quite old stats.

metadat

1 month ago

[-]

The Internet Archive's Infrastructure https://news.ycombinator.com/item?id=46613324 - 8 days ago, 124 comments

https://hackernoon.com/the-long-now-of-the-web-inside-the-in...

jonas21

1 month ago

[-]

It does have these details for the current generation hardware. And if you want more, click on the link at the top:

reaperducer

1 month ago

[-]

Yeah, this is just blogspam. Some guy re-hashing the Hackernoon article, interspersed with his own comments.

I wouldn't be surprised if it's AI.

It's time to come up with a term for blog posts that are just AI-augmented re-hashes of other people's writing.

Maybe blogslop.

dexdal

1 month ago

[-]

That pattern shows up when publishing has near-zero cost and review has no gate. The fix is procedural: define what counts as original contribution and require a quick verification pass before posting. Without an input filter and a stop rule, you get infinite rephrases that drown out the scarce primary work.

tolerance

1 month ago

[-]

You and I must be different kinds of readers.

I’m under the impression that this style of writing is what people wish they got when they asked AI to summarize a lengthy web page. It’s criticism and commentary. I can’t see how you missed out on the passages that add to and even correct or argue against statements made in the Hackernoon article.

In a way I can’t tell how one can believe that “re-hashing [an article], interspersed with [the blogger’s] own comments” isn’t a common blogging practice. If not then the internet made a mistake by allowing the likes of John Gruber to earn a living this way.

And trust that I enjoy a good knee-jerk “slop” charge myself. To me this doesn’t qualify a bit.

schainks

1 month ago

[-]

What a slog post.

jonah-archive

1 month ago

[-]

This comment has, I think, made me more sad than anything I've ever read on HN before. David is one of the most thoughtful, critical, and valuable voices on the topic of digital archival, and has been for quite some time. The idea of someone dismissing his review of a much more slop-adjacent article as such is incredibly depressing.

arcade79

1 month ago

[-]

While reading this kind of articles, I'm always surprised by how small the storage described is. Given that Microsoft released their paper on LRCs in 2012, Google patented a bunch in 2010, facebook talked about their stuff around the 2010-2014 era too. CEPH started getting good erasure codes around 2016-2020.

Has any of the big ones released articles on their storage systems in the last 5-10 years?

https://cloud.google.com/blog/products/storage-data-transfer...

smueller1234

1 month ago

[-]

IIRC, the most recent and most technical public content we (Google) have published on Colossus are these:

Facebook's published content on Tectonic is quite good and I think it's well more recent than 2010-14.

(Current Google employee, just pointing to public content, hope that's helpful.)

arcade79

1 month ago

[-]

Nice, the L4 cache seem to be a newish addition. Love the detail about two filesystems with >10 exabytes of storage.

theMMaI

1 month ago

[-]

All the big ones have talked about their storage systems, but have been reluctant publishing papers like they used to do, so it appears to be more of a marketing focused effort than trying to share the technical details with the world.

tylerchilds

1 month ago

[-]

Why’s Wendy’s Terracotta moved?

https://en.wikipedia.org/wiki/Executive_Order_9066

tylerchilds

1 month ago

[-]

Every time I’ve seen that front pew in that first photo, she’s there too, holding this:

https://en.wikipedia.org/wiki/Wayback_Machine

1vuio0pswjnm7

1 month ago

[-]

List of references

https://blog.archive.org/2025/09/02/looking-back-on-preservi...

https://archive.org/web/petabox.php

https://en.wikipedia.org/wiki/PetaBox

https://ipfs.tech/

https://github.com/internetarchive/dweb-archive

https://en.wikipedia.org/wiki/Internet_Archive

https://www.eweek.com/storage/making-web-memories-with-the-p...

https://internetarchive.archiveteam.org/index.php/PetaBox

https://blog.archive.org/2010/07/27/the-fourth-generation-pe...

https://hackaday.com/2025/11/18/internet-archive-hits-one-tr...

https://www.computerworld.com/article/1562759/the-internet-a...

https://www.datacenterknowledge.com/business/internet-archiv...

https://www.rootsimple.com/2023/08/inside-the-internet-archi...

https://richmondsunsetnews.com/2017/03/11/internet-archive-p...

https://en.wikipedia.org/wiki/Heritrix

https://support.archive-it.org/hc/en-us/articles/11500108118...

https://digitalcommons.odu.edu/cgi/viewcontent.cgi?article=1...

https://iipc.github.io/warc-specifications/specifications/wa...

https://usehall.com/agents/heritrix-bot

https://library.imaging.org/admin/apis/public/api/ist/websit...

https://blog.archive.org/2025/03/

https://archive.org/details/alexacrawls

https://en.wikipedia.org/wiki/Alexa_Internet

https://projects.propublica.org/nonprofits/organizations/943...

https://werd.io/update-on-the-20242025-end-of-term-web-archi...

https://www.historyascode.com/tools-data/archive-it/

https://digitization.archive.org/pricing/

https://www.sfgate.com/tech/article/bay-area-warehouse-inter...

https://vault-webservices.zendesk.com/hc/en-us/articles/2289...

https://en.wikipedia.org/wiki/Hachette_v._Internet_Archive

https://copyrightalliance.org/copyright-cases/hachette-book-...

https://law.justia.com/cases/federal/appellate-courts/ca2/23...

https://www.library.upenn.edu/news/hachette-v-internet-archi...

https://www.lutzker.com/ip_bit_pieces/internet-archives-open...

https://blog.archive.org/2023/08/17/what-the-hachette-v-inte...

https://www.musicbusinessworldwide.com/labels-settle-copyrig...

https://consequence.net/2025/09/internet-archive-labels-sett...

https://blog.archive.org/2025/09/15/an-update-on-the-great-7...

https://giga.law/daily-news/2025/9/15/music-publishers-inter...

https://www.webpronews.com/internet-archive-settles-copyrigh...

https://blog.archive.org/2025/07/

https://blog.archive.org/2018/07/21/decentralized-web-faq/

https://blog.archive.org/2016/06/23/decentalized-web-server-...

https://blog.archive.org/2025/02/06/update-on-the-2024-2025-...

https://www.reddit.com/r/DataHoarder/comments/1ijkdjl/progre...

From https://news.ycombinator.com/item?id=46613324

1 month ago

[-]

No climate control. No backup power. And it's secured by a wireless camera sitting in a potted plant. Bless them, but wow.

https://news.ycombinator.com/newsguidelines.html

dang

1 month ago

[-]

"Please don't post shallow dismissals, especially of other people's work. A good critical comment teaches us something."

"Don't be snarky."

1 month ago

[-]

No, really: access to the server racks is solely protected by a battery-operated camera nestled into the fake dirt of a plastic floor plant.

dang

1 month ago

[-]

Ok, I'm going to assume that I misinterpreted your comment and that you didn't mean to be snarky!

chimeracoder

1 month ago

[-]

> In the unlikely, for San Francisco, event that the day is too hot, less-urgent tasks can be delayed, or some of the racks can have their clock rate reduced, disks put into sleep mode, or even be powered down. Redundancy means that the data will be available elsewhere.

So it sounds like they have data in other locations as well, hopefully.

[1] https://en.wikipedia.org/wiki/Internet_Archive#Operations

electroly

1 month ago

[-]

There's a mention on Wikipedia [1] that the Internet Archive maintains international mirror sites in Egypt and the Netherlands, in addition to several domestic sites within North America.

1 month ago

[-]

During the recent power outages in San Francisco, the site repeatedly went down. When a troubled individual set the power pole on fire outside their building, the site went down. Happy to give them the benefit of the doubt on data redundancy, but they publicly celebrate that Brewster himself has to bike down and flip switches to get the site back online. They don't even have employee redundancy.

krackers

1 month ago

[-]

And a site that's in a notorious earth-quake prone zone. I can only hope that with all the AI craze one of the bigcorp made a deal to take a copy of all data in exchange for providing it as backup if necessary

textfiles

1 month ago

[-]

Badlibrarian, you had a very fun run.

You got to be 30% correct with Internet Archive criticism and enjoy unfettered, sometimes problematic commentary with little pushback.

Maybe you should take what your version of the W is.

tolerance

1 month ago

[-]

Flaggers—on the occasion that the Internet Archive project collapses, badlibrarian’s name (indicating attitude, not acumen) in addition to their comments history checks out as a “told you so”.

1 month ago

[-]

I wish them the best (and support them in ways they're not even aware of). But they really need to get their act together. The public statements and basic stats do not match reality. An actual board and annual reports would be a nice start.