Indeed, 128 KB is a well-known long lasted optimal buffer size [1], [2].
Until it has been increased to 256 KB recently (07.04.2024) [3].
[1] https://github.com/MidnightCommander/mc/commit/e7c01c7781dcd...
[2] https://github.com/MidnightCommander/mc/issues/2193
[3] https://github.com/MidnightCommander/mc/commit/933b111a5dc7d...
In 2014, the common heuristic was 256kB based on measurements in many systems, so the 128kB value is in line with that. At the time, optimal block sizing wasn't that sensitive to the I/O architecture so many people arrived at the same values.
In 2024, the optimal block size based on measurement largely reflects the quality and design of your I/O architecture. Vast improvements in storage hardware expose limitations of the software design to a much greater extent than a decade ago. As a general observation, the optimal I/O sizing in sophisticated implementations has been trending toward smaller sizes over the last decade, not larger.
The seeming optimality of large block sizes is often a symptom of an I/O scheduling design that can't keep up with the performance of current storage hardware.
If you just want to saturate the bandwidth, to move some coherent blob of data from point A to point B as fast as possible (say you're implementing the `cp` command), then using large buffers is the best and easiest way. Small buffers confer no additional benefit other than driving more complicated designs, forcing io_uring with registered buffers and fds, etc.
If you want to maximize IOPS, then by the fact that we just established that large buffers saturate the bandwidth better, small buffers is the only viable option, but then you need to whittle down the per-read overhead, and end up with io_uring or even more specialized tools.
It has some nice information about hardware io operation limits, and also an optimal_io_size hint.
The downside of SPDK is that it is unreasonably painful to use in most contexts. When it was introduced there were few options for doing high-performance storage I/O but a lot has changed since then. I know many people that have tested SPDK in storage engines, myself included, but none that decided the juice was worth the squeeze.
See Ex. https://www.vldb.org/pvldb/vol16/p2090-haas.pdf What Modern NVMe Storage Can Do, And How To Exploit It: High-Performance I/O for High-Performance Storage Engines
for actual data on this.
OFC, If your block size is large enough and/or your design is batching enough etc. that you already don't spend much time in issuing IO/reaping completion then as you say, SPDK will not provide much of a gain.
The point of NVMe namespaces is to partition at the block device layer. To turn one physical block device into multiple logical block devices, each with their own queues, LBA space, etc. It's for when your tenants are interacting with the block device directly. That's not a hack, that's intended functionality.
That paper seems to mostly focus on throughput via concurrent independent queries, rather than single-query performance. It's arriving at a different solution because it's optimizing for a different variable.
Also, an index with larger block sizes is not equivalent to a structure with smaller block sizes with readahead. The index structure is not the same since having larger coherent blocks gives you better precision in your indexing structure for the same number of total forward pointers, as there's no need to index within each 128 KB block, the forward pointer resolution that would have gone to distinguishing between 4K blocks can instead help you rapidly find the next relevant 128 KB block.
> A counter argument might be that this drives massive read amplification,
For that, one need to know the true minimal block size SSD controller is able to physically read from flash. Asking for less than this wouldn't avoid the amplification.
Samsung has client, datacenter, and enterprise lines. The PM9A1 is part of the OEM client segment and is about the same as a 980 Pro. Its top speeds (about 7GB/s read, 5GB/s write) are better than the comparable datacenter class drive, PM9A3. This top speeds comes with less consistent performance than you get with a PM9A3 or an enterprise drive like a PM1733 from the same era (early PCIe Gen 4 drives).
map<term_id,
list<pair<document_id, positions_idx>>
> inverted_index;
and not map<term_id,
map<document_id, list<positions_idx>>
> inverted_index;
(or using set<> in lieu of list<> as appropriate)?I actually think you are right, list<pair<...>> is a bit of a weird choice that doesn't quite convey the data structures quite well. Map is better.
The most accurate thing would probably be something like map<term_id, map<document_id, pair<document_id, positions_idx>>>, but I corrected it to just a map<document_id, positions_idx> to avoid making things too confusing.
map<term_id,
map<pair<document_id, positions_idx>>
inverted_index;
list<positions> positions;Think you also meant to remove the pair in map<pair>?