If you need to know it’s been persisted to non-volatile storage then you need to own the full stack of every piece of software between the OS and the actual physical memory.
Every managed flash drive is going to have layers and layers of complexity and caching and things you simply can’t easily control or really understand. Don’t trust it unless you know exactly how it works all the way down.
My guess is the preallocation + zeroing is what got them most of the win, and the O_DIRECT is actually hurting, not helping throughput. This has been the case 100% of the time I've benchmarked such things.
If you're doing this sort of stuff for real under Linux, check out sync_file_range. It's the only non-broken and performant sync API for ext4 (note that it's broken by design for many other file systems, and the API is terribly difficult to use correctly).
If you really care, it's probably just easier to use SPDK or something. Linux has historically been pretty hostile towards DBMS implementations.
This is not a new trick. It has been used in many storage engine designs to effect durability without an fsync.
if we give ourselves two definitions of persisted - logically(wal or write) and physically (index or read), it seems like we can maintain the invariant that P < L. (1) by keeping an in memory view of P-L that we have to consult on every read to assert eh delta and (2) an expensive but asynchronous flush path for updating P driven from reads verifying L has landed, then have we patched all the holes(?).
Famously not, as the man page says.
It is also said later in the article:
> POSIX strictly requires a parent-directory fsync to make a newly created file’s existence durable.
So I'm not sure why the dirent sync is claimed earlier.
Even if you just look at hardware failure rates, you get unrecoverable I/O errors (data corruption) at about one in 10^15 bits, disk failures at a rate of about 1% per year, etc. People usually like to have better guarantees than those numbers give you with just a plain fsync anyway; so you are probably forced to do an analysis of the whole system if you want to provide good durability guarantees and be able to explain where the guarantees come from.
If you’re building a data storage system and are using the term “durable” to mean “it’s in RAM on three virtual machines”, for example, I don’t think it’s unfair to say that you are lying to your customers, because you are intentionally misusing a well-established term.
And I wouldn't assume they meant that number to be per record in the first place.
I don't see how a virtualised NVMe disk is different from a physical one.
Especially if you don't have control over the underlying hardware (so you don't know if it has power-loss-protection PLP SSDs), you should send the FUA.
> O_DATA_SYNC
You mean `O_DSYNC`?
Why would you need `O_DSYNC` on-premise, but not on cloud VMs? (Or are you saying you'd include it everywhere?) Similar to my above point, surely it is the task of the VM to pass through any FUA commands the VM guest issues to the actual storage?
Further: Is `O_DSYNC` actually substantially different from writing and then `fdatasync()`ing yourself?
My understand is that no, it's the same. In particular, the same amount of data gets written. So if you believe that to avoid the "can trigger an order of magnitude more I/O" by avoiding `fdatasync()`, you would re-introduce it with `O_DSYNC`.
However, I suspect that that whole consideration is pointless:
The only thing that makes your O_DIRECT+preallocated-only-overwrites writes safe are enterprise SSDs with Power Loss Protection (PLP), usually capacitors.
On those SSDs, NVMe Flush/FUA are no-ops [1]. So you might as well `fdatasync()`/`O_DSYNC`, always. This is simpler, and also better because you do not need to assume/hope that your underlying SSDs have PLP: Doing the safe thing is fast on PLP [2], and safe on non-PLP.
[1] https://news.ycombinator.com/item?id=46532675
[2] https://tanelpoder.com/posts/using-pg-test-fsync-for-testing-low-latency-writes/
So the only remaining benefit of `O_DSYNC` over `fdatasync()` is that you save a syscall. That's an OK optimisation given they are equivalent, but it would surprise me if it had any noticeable impact at the latencies you are reporting ("413 us"), because [2] reports the difference beting 6 us.Let me know if I got anything wrong.
The only remaining question is: Why do you then see any difference in your benchmark?
Configuration Throughput (obj/s)
-------------------------------------------
ext4 + O_DIRECT + fsync 116,041
Our engine 190,985
That is what I'd find very valuable to investigate.The first suspicion I have is: Shouldn't you be measuring `+ fdatasync` instead?
So I'd be interested in:
ext4 + O_DIRECT + fdatasync
ext4 + O_DIRECT + O_DSYNC
Our engine + O_DSYNC (which you're suggesting above)
Also I don't fully understand what the remaining diference between "ext4 + O_DIRECT + O_DSYNC" and "Our engine + O_DSYNC" would be.That is where the disparity lies here. Reading back the data after the device reports that it has been written offers little in the way of additional assurances that it's successfully written. But if you report successful writes without syncing, there is a near certainty that you'll lose data on every power loss.
EDIT: sketchy from an answering "what exactly are the guarantees?" perspective
Some storage devices guarantee durability of non-persisted writes, which is explicitly part of their model. Consequently, the entire durable write path is the storage device completing a DMA read of their buffer.
The underlying assumptions will not hold true for every environment. However, it will hold true for many and you can check most (all?) of them at runtime.
Bookmarked your whole blog for later consumption, interesting stuff!
[1] https://fractalbits.com/blog/metadata-engine-for-our-object-...
Would you be so kind to explain what happens in a power-loss scenario?
It will also make the system initialization faster, since right now we need to write all zeros to make ext4/xfs to actually initialize extents as "allocated".
Both of these are commonly done in database storage engines.