File Systems Unfit as Distributed Storage Back Ends (2019)
71 points
1 day ago
| 7 comments
| dl.acm.org
| HN
acidmath
1 day ago
[-]
Before Bluestore, we ran Ceph on ZFS with the ZFS Intent Log on NVDIMM (basically non-volatile RAM backed by a battery). The performance was extremely good. Today, we run Bluestore on ZVOLs on the same setup and if the zpool is a "hybrid" pool we put the Ceph OSD databases on an all-NVMe zpool. Ceph WAL wants a disk slice for each OSD, so we don't do Ceph WAL and consolidate incoming writes on the ZiL/SLOG on NVDIMM.
reply
nightfly
1 day ago
[-]
Why ceph on ZVOLs and not bare disks?
reply
acidmath
1 day ago
[-]
In the servers we have only 16gb to 64gb of NVDIMM, depending on density of NVDIMM and how many slots are populated with NVDIMM. Whatever raw NVDIMM is, usable is half because we mirror the contents for physical redundancy (if we lose a transaction it is fatal to our business). NVMe is amazing, but not everything should be NVMe, like petabyte scale object storage for example does not need to be on all NVMe (which is super pricey).

In newer DDR5 servers where we can't get NVDIMM, the alternative battery backed RAM options leave us with even less to work with.

Where we have counts of HDDs or SATA/SAS SSDs in the hundreds, we still want the performance improvements provided by WAL (or functional equivalent such as ZiL/SLOG) on NVDIMM and some layer-2 (where layer-1 is RAM) caching with NVMe.

Ceph OSDs want a dedicated WAL device. Some places use OpenCAS to make "hybrid" devices out of HDDs by pairing them with SSDs where the SSDs can accelerate reads for that HDD and the Ceph OSD goes on a logical OpenCAS device. OpenCAS is really great, but the devices acting as "caching layer" often end up underutilized.

By placing "big" Ceph OSDs on ZVOLs, we don't have individual disk slices for WAL (or equivalent) or individual disks for layer-2 read caching, but a consolidated layer in the form of ZFS Intent Log on "Separate Log" (NVDIMM) and another consolidated layer in the ZFS disk pool's L2ARC (layer-2 adaptive readback cache).

The ZVOLs are striped across multiple relatively large RAIDz3 arrays. Yeah, it's "less efficient" in some ways, but the tradeoff is worth it for us.

  https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#devices
  https://open-cas.com/
reply
__turbobrew__
1 day ago
[-]
Do you have any recommendations or warnings about running ceph clusters?
reply
Agingcoder
1 day ago
[-]
Find people who understand it. I’ve seen epic failures when things grow , you lose a DC and hell rains on you. It’s not magic , you will need people who get it ( source : unstable cluster of a few petabytes where I work ).
reply
acidmath
23 hours ago
[-]
Just off the top of my head:

Run Ceph on https://rook.io/ ; don't bother with Cephadm. Running Rook provides very helpful guard rails. Put the logs for Ceph Rook into Elasticsearch+Kibana on its own small (three or four node) dedicated Ceph Rook cluster. Which Kubernetes distro this runs on matters more than anything.

Recently we are looking at using https://www.parseable.com/ instead of Elasticsearch+Kibana. And we had somewhat recently started moving things from Elaticsearch+Kibana to OpenSearch+OpenSearchDashboards due to the license change.

The requirement outlined by Ceph documentation to dedicate layer-1 paths (can be same switches, but must be different ports) to Ceph replication is not about "performance" but about normal functionality.

If you have any pointed questions feel free to email "section two thirty audit@mail2tor dot com" (where "two thirty" are the three digits rather than spelled out).

reply
__turbobrew__
21 hours ago
[-]
I already set things up with Rook as we are super heavily invested into kubernetes, and things are working well so far. I built out a test cluster to 1PiB and was able to push more than a terrabit/second through the cluster which was good.

I also set up topology aware replication so pg’s can be spread across racks/datacenters.

My main worry now is disaster recovery. From what I have seen, object recovery is quite manual if you lose any. I would like to write some scripts so we can bulk mark objects which we know are actually lost.

We already have a loki setup, so ceph logs just get put into there.

reply
acidmath
14 hours ago
[-]
> object recovery is quite manual if you lose any

When I read this I think "but you should never lose an object". Do you mean like the underlying data chunks Ceph stores? Can you elaborate on this part? I know some of the teams I work with do things in unorthodox ways and we tend to operate on different assumptions than others.

> so pg’s can be spread across racks/datacenters.

Some Ceph pools come to mind (this was a while ago, I'm sure they're still running though) where the erasure coding was done across cabinet rows and each cabinet row was on its own power distribution. I don't know how the power worked but I was told rather forwardly that some specific Ceph pools' failure domains aligned with the datacenter's failure domains.

> We already have a loki setup

Nice. We have logs go into S3 and then anyone who prefers a particular tool is welcome to load whatever sets of logs from S3 within the resource limits set for whatever K8s namespace they work with. Originally keeping logs append-only in S3 was for compliance but we wanted to limit team members by RAM quota rather than tools in line with the "people over tools over process" DevOps maxim.

reply
shermantanktop
1 day ago
[-]
Also known as:

Write! No, fsync! No, really fsync I mean it!

Wait, why is my disk throughput so low? And why am I out of file descriptors?

reply
chupasaurus
1 day ago
[-]
Article is focused on Ceph where FS is a frontend to the storage backend(s), now read the title again...
reply
Dylan16807
1 day ago
[-]
> Wait, why is my disk throughput so low?

Because many filesystems do fsync wrong, for reasons that are not inherent to filesystems in general.

reply
baruch
1 day ago
[-]
It's easier to write the system's front end while paying little attention to the backend and "just" letting a local filesystem do a lot of the work for you, but it doesn't work well. The interesting question is if the result is also that the frontend-to-backend communication abstraction is good enough to replace the backend with a better solution. I'm not familiar enough with Ceph and BlueStore to have a conclusion on that.

I happen to work for a distributed file-system company, and while I don't do the filesystem part itself, the old saying "it takes software 10 years to mature" is so true in this domain.

reply
sitkack
1 day ago
[-]
See also "Hierarchical File Systems are Dead" by Margo Seltzer and Nicholas Murphy https://www.usenix.org/legacy/events/hotos09/tech/full_paper...
reply
MR4D
1 day ago
[-]
No mention of LATCH theory? (Location, Alphabet, Time, Category, and Hierarchy)

Oddly, no matter how they are organized, their indices will always be a hierarchy (tree).

Personally, I think human brains just have a categorization approach that is built into our brains as hierarchy, so while other methods are definitely useful, they are an add-on, not a replacement.

reply
01HNNWZ0MV43FF
1 day ago
[-]
reply
zokier
1 day ago
[-]
Lot's of these issues seem to be not specific to distributed systems and also impact local single-node systems. Notable example is postgresql fsyncgate, or how mail servers in the past struggled (iirc that was one of the cases where reiserfs shined).
reply
resurrected
1 day ago
[-]
Noooo, really?

It all depends on what you want to do. For things that are already in files like all that data that DeepSeek and other models train on and for which DS open sourced their own distributed file system, it makes sense to go with a distributed file system.

For OLTP you need a database with appropriate isolation levels.

I know someone will build a distributed file system on top of FoundationDB if they haven’t yet.

reply
_zoltan_
1 day ago
[-]
~2006 I've built a fuse fs that used mysql as a backend, kept all file hashes (not blocks, just whole files) and did deduplication. good old times.
reply
darkstar_16
1 day ago
[-]
Isn't the Cassandra file system something like that ?
reply
AtlasBarfed
1 day ago
[-]
They did it atop Cassandra.
reply
jeffrallen
1 day ago
[-]
They have, at Exoscale. My officemate leads the team doing it.
reply
EGreg
1 day ago
[-]
Just use hypercore with hyperdrive. And be free!
reply
Spivak
1 day ago
[-]
It really is true, I spent years of my life wrangling a massive glusterfs cluster and it was awful. You basically can't do any kind of file system operations on it that aren't CRUD on well known specific paths. Anything else— traversal, moving/copying, linking, updating permissions would just hang forever. You're also at the mercy of the kernel driver which does hate you personally. You will have nightmares about uninterruptible sleep. Migrating it all to S3 over Ceph was a beautiful thing.
reply
ted_dunning
1 day ago
[-]
That has more to do with gluster's primitive nature than with a general statement of what can work for distributed storage.
reply