That’s impressive. Wikipedia spends $185m per year and the Seattle public library spends $102m. Maybe not comparable exactly, but $30m per year seems inexpensive for the memory of the world…
Are there any other data centers harvesting waste heat for benefit?
https://www.twobirds.com/en/insights/2024/germany/rechenzent...
Also data centers need physical space, and often - you need heating where there is not a lot of space (cities), and for "district heating" you need higher temperatures usually.
https://datacentremagazine.com/data-centres/excess-data-cent...
https://www.bbc.com/news/technology-64939558
https://arstechnica.com/information-technology/2023/03/free-...
https://www.euroheat.org/dhc/knowledge-hub/datacentre-suppli...
* power budget dominates everything: I have access to a lot of rack hardware from old connections, but I don't want to put the army of old stuff in my cabinet because it will blow my power budget for not that much performance in comparison to my 9755. What disks does the IA use? Any specific variety or like Backblaze a large variety?
* magnetic is bloody slow: I'm not the Internet Archive so I'm just going to have a couple of machines with a few hundred TiB. I'm planning on making them all a big zfs so I can deduplicate but it seems like if I get a single disk failure I'm doomed to a massive rebuild
I'm sure I can work it out with a modern LLM, but maybe someone here has experience with actually running massive storage and the use-case where tomorrow's data is almost the same as today's - as is the case with the Internet Archive where tomorrow's copy of wiki.roshangeorge.dev will look, even at the block level, like yesterday's copy.
The last time I built with multi-petabyte datasets we were still using Hadoop on HDFS, haha!
Yeah, resilvers will take 24h if your pool is getting full but with RAIDZ2 it's not that scary.
I'm running TrueNAS scale. I used to just use Ubuntu (more flexible!) but over many years I had a some bad upgrades where kernel & zfs stopped being friends. My rack is pretty nearby so for me, a big 4U case with 120mm front fans was high priority, it has a good noise profile if you replace with Noctuas, you get a constant "whoosh" rather than a whine etc.
Running 8+2 with 24tb drives. I used to run with 20 slots full of old ex-cloud SAS drives but it's more heat / noise / power intensive. Also, you lose flexibility if you don't have free slots. So eventually ponied up for 24tb disks. It hurt my wallet but greatly reduced noise and power.
Case: RM43-320-RS 4U
CPU: Intel Xeon E3-1231 v3 @ 3.40GHz (4C/8T, 22nm, 80W TDP)
RAM: 32GB DDR3 ECC
Motherboard: Supermicro X10SL7-F (microATX, LGA1150 socket)
- Onboard: Dual Intel I210 1GbE (unused)
- Onboard: LSI SAS2308 8-port SAS2 controller (6Gbps, IT mode)
- Onboard: Intel C220 chipset 6-port SATA controller
Storage Controllers:
- LSI SAS2308 (onboard) → Intel RES2SV240 backplane (SFF-8087 cables)
- Intel C220 SATA (onboard) → boot SSD
Backplane:
- Intel RES2SV240 24-bay 2U/3U SAS2 Expander
- 20× 3.5" hot-swap bays (10 populated, 10 empty)
- Connects via Mini SAS HD SFF-8643 to Mini SAS SFF-8087 Cable, 0.8M x 5
Boot/Cache:
- Intel 120GB SSD SSDSC2CW120A3 (boot drive, SATA)
- Intel Optane 280GB SSDPED1D280GA (ZFS SLOG device, NVMe)
Network:
- Intel 82599ES dual-port 10GbE SFP+ NIC (PCIe x8 add-in card)
It's a super old box but it does fine and will max 10Gbe for sequential and do 10k write iops / 1k random read iops without problems. Not great, not terrible. You don't really need the SLOG unless you plan to run VMs or databases off it.I personally try to run with no more than 10 slots out of 20 used. This gives a bit of flexibility for expanding, auxiliary pools, etc etc. Often you find you need twice as much storage as you're planning on directly using. For upgrades, snapshots, transfers, ad-hoc stuff etc.
Re: dedup, I would personally look to dedup at the application layer rather than in the filesystem if I possibly could? If you are running custom archiving software then it's something you'd want to handle in the scope of that. Depends on the data obviously, but it's going to be more predictable, and you understand your data the best. I don't have zfs de-dup turned on but for a 200TiB pool with 128k blocks, the zfs DDT will want like 500GiB ram. Which is NOT cheap in 2026.
I also run a 7-node ceph cluster "for funsies". I love the flexibility of it... but I don't think ceph truly makes sense until you have multiple racks or you have hard 24/7 requirements.
for the first two, depending on throughput desired, you can do with spinning rust. you pick your exposure, single platter or not, speed or not, and interface. And no fancy raid hardware needed.
I've had decent luck with 3+1 warm and 4+1 archival. if you don't need quick seeks but want streaming data to be nice, make sure your largest file fits on a single drive, and do two parity disks for archive, a single for warm. md + lvm; ext4 fs, too. my very biased opinion based on tried everything and am out of ideas, and i am tired, and that stuff just works. I am not quick to the point but you need to split your storage up. use 18+ SMR disks, shingled magnetic recording hard drives, for larger stuff that you don't need to transfer very fast. 4k video for consumption on a 4k televsion fits here. Use faster, more reliable disks for data used a lot, &c
Hot or fast seeks & transfers is different, but i didn't get the idea that's what you were after. Hadoop ought be used for hot data, imo. People may argue that zfs of xfs or jfs or ffs is better than ext4, but are they gunna jump in and fix it for free when something goes wrong for whatever reason?
sorry, this is confusing. Unsure how to fix that. i have files on this style system that have been in continuous readable condition since the mid 1990s. There's been some bumps as i tried every [sic] other system and method.
TL;dr to scale my 1/10th size up, i personally would just get a bigger box to put the disks in, and add an additional /volumeN/ mountpoint for each additional array i added. it goes without saying that under that directory i would CIFS/NFS share subdirectories that fit that array's specifications. again, i am just tired of all of this, i'm also all socialed out so, apologies.
https://hackernoon.com/the-long-now-of-the-web-inside-the-in...
I wouldn't be surprised if it's AI.
It's time to come up with a term for blog posts that are just AI-augmented re-hashes of other people's writing.
Maybe blogslop.
I’m under the impression that this style of writing is what people wish they got when they asked AI to summarize a lengthy web page. It’s criticism and commentary. I can’t see how you missed out on the passages that add to and even correct or argue against statements made in the Hackernoon article.
In a way I can’t tell how one can believe that “re-hashing [an article], interspersed with [the blogger’s] own comments” isn’t a common blogging practice. If not then the internet made a mistake by allowing the likes of John Gruber to earn a living this way.
And trust that I enjoy a good knee-jerk “slop” charge myself. To me this doesn’t qualify a bit.
"Don't be snarky."
So it sounds like they have data in other locations as well, hopefully.
[1] https://en.wikipedia.org/wiki/Internet_Archive#Operations