FilterHN

The real cost of random I/O

74 points

by jpineman

3 days ago

| past

| 5 comments

| vondra.me

| HN

▲

dwedge

1 hour ago

[-]

Is anyone able to explain why it's so much slower when solid state doesn't really care about the data location? Is this simply a quirk of postgres where the index scan requires two reads (unless I'm mistaken) while with mysql the primary key index is the data. I'd be curious to see comparisons here with mysql and also sequential/random read straight from disk

▲

pgaddict

38 minutes ago

[-]

There probably is some additional inefficiency when reading pages randomly (compared to sequential reads), but most of the difference is at the storage level. That is, SSDs can handle a lot of random I/O, but it's nowhere close to sequential reads.

For example, I have a RAID0 with 4 SSDs (Samsung 990 PRO, so consumer, but quite good for reads). And this is what fio says:

# random reads, 8K, direct IO, depth=1

fio --filename=device name --direct=1 --rw=randread --bs=4k --ioengine=libaio --iodepth=256 --runtime=120 --numjobs=4 --time_based --group_reporting --name=iops-test-job --eta-newline=1 --readonly

-> read: IOPS=19.1k, BW=149MiB/s (156MB/s)(4473MiB/30001msec)

# sequential reads, 8K, direct IO, depth=1

fio --filename=/dev/md127 --direct=1 --rw=read --bs=8k --ioengine=io_uring --iodepth=1 --runtime=30 --numjobs=1 --time_based --group_reporting --name=random-1 --eta-newline=1 --readonly

-> read: IOPS=85.5k, BW=668MiB/s (700MB/s)(19.6GiB/30001msec)

With buffered I/O, random read stay at ~19k IOPS, while sequential reads get to ~1M IOPS (thanks to read-ahead, either at the OS level, or in the SSD).

So part of this is sequential reads benefiting from implicit "prefetching", which reduces the observed cost of a page. But for random I/O there's no such thing, and so it seems more expensive.

It's more complex (e.g. sequential reads allow issuing larger reads), of course.

▲

ozgrakkurt

36 minutes ago

[-]

NVMe's really do care about location when you hit some concurrency/size limit.

Manufacturers use many hacks like caching writes on disk etc. In my experience, it is rare to have an ssd that actualy behaves like it is expected to.

A solid way of measuring this is using fio with different configurations.

▲

re-thc

17 minutes ago

[-]

> Is anyone able to explain why it's so much slower when solid state doesn't really care about the data location?

It does. Just differently.

E.g. a lot of SSDs nowadays cheap out and save money by using slower and poorer quality NAND + faster and high quality NAND cache. So random often misses the cache a lot more.

▲

convolvatron

1 hour ago

[-]

you're right. there are a couple explanations that might have some merit looking at it from the device perspective. one is the the underlying block size is really large, so that looks like a very large cache line that a sequential scan will always hit. its also very likely that there are prefetchers running to try and hide the latency.

▲

ozgrakkurt

38 minutes ago

[-]

Would recommend adding the specific SSD model and also adding some fio benchmarks to show the real limits of the ssd.

SSDs can be connected to a machine through raid cards or some enclosures etc. etc. And all of this makes massive differences compared to a proper on-board PCIe connection.

Also obviously SSDs have very vastly different performance characteristics.

For example an ssd might look very good at writes until you keep writing for more than some amount without any breaks, and then it becomes super slow.

▲

pgaddict

21 minutes ago

[-]

Good point, I should have included that (the linked pgsql-hackers thread have some of this information, at least).

I've observed exactly this behavior on a wide range of hardware / environments, it's not very specific to particular SSDs models (at least not for reads, which is what the blog post was measuring). That's why I showed results from three very different systems.

Some information for the two physical machines:

1) ryzen: Ryzen 9 9900X, RAID0 with 4x Samsung 990 PRO 1TB (in Asus Hyper M.2 Gen5 card)

2) xeon: E5-2699v4, WD Ultrastar DC SN640 960GB (U.3)

I don't know what exactly is backing the SSD storage on the Azure instance.

▲

sgarland

4 hours ago

[-]

Interesting post! IME, setting random_page_cost to 1.1 is more likely to produce good results overall, which is what the ending paragraphs of the post allude to. I’ve also seen situations where it makes the result significantly worse, but they’re relatively rare, and can usually be addressed with a better index.

In general, there are a dizzying number of parameters for both MySQL and Postgres (I assume Oracle and SQL Server as well, but I don’t have experience with them), and many of them can have surprising results. One such example for MySQL is innodb_io_capacity[_max]. The docs [0] say that you should set it to the number of IOPS your system is capable of, and that InnoDB will then use that to guide its background operations. As of version 8.4, the default value has been raised from 200 to 10000. Granted, I haven’t used 8.4 (or 9.x for that matter) in prod, but with 5.7 and 8.0, the advice from Percona [1], and what I’ve found with my own workloads, is to leave it alone - going higher can reduce performance by adding additional write loads (and, as the post points out, prematurely wear out SSDs if you’re running your own).

0: https://dev.mysql.com/doc/refman/8.0/en/innodb-configuring-i...

1: https://www.percona.com/blog/give-love-to-your-ssds-reduce-i...

▲

bob1029

2 hours ago

[-]

MSSQL keeps a lot of this stuff internal. This is why I prefer it for certain scenarios. Worrying about the tuning parameters can be overwhelming if you aren't pushing the instance to its limits.

▲

jasonhansel

3 hours ago

[-]

It'd be interesting to see an RDBMS that actually dynamically measures the performance characteristics of the drive it's running on (by occasionally running small "fio"-like benchmarks, or by inferring them from scan execution times).

▲

touisteur

2 hours ago

[-]

Genuinely curious: where would one expect the drive performance to fluctuate? Wear ? Lack of TRIM ? Some form of timely GC process on disk firmware ? Fragmentation or compaction of some sort ? Maybe weird shenanigans with RAID setups with disks from different vendors and batches ?

Embarking right now on a long-term embedded storage project and wondering what people actually monitor (apart from SMART and latency/throughput at app or db-level).

▲

nine_k

46 seconds ago

[-]

It could be regular, like SQL "analyze table", it could be one-off. The point is that it would be an automatic tool.

▲

Normal_gaussian

7 hours ago

[-]

In some ways I found this a little surprising; however the final paragraph about real working data not being accessed as randomly as you think is my experience.

I'm curious about ways to live automatically tune this. You can use SET LOCAL to transaction scope a value, yet this would mean managing it completely in your application.

▲

sgarland

4 hours ago

[-]

I suppose if you had a proxying layer capable of query rewrites (ProxySQL can do so, and supports Postgres in newer versions), you could have it prepend the SET LOCAL commands to specific queries as needed. You could take this further, and have a sidecar that’s examining query performance (either by directly querying ProxySQL, or waiting for Prometheus stats), and performing adjustments as needed.

That’s actually an interesting idea, now that I think about it. You could have it running the queries as EXPLAIN in the background, then occasionally testing a change out with EXPLAIN ANALYZE before adjusting the settings to use for rewrites.