FilterHN

I built a 2x faster lexer, then discovered I/O was the real bottleneck

41 points

by modulovalue

4 days ago

| past

| 8 comments

| modulovalue.com

| HN

▲

marginalia_nu

4 hours ago

[-]

Zip with no compression is a nice contender for a container format that shouldn't be slept on. It effectively reduces the I/O, while unlike TAR, allowing direct random to the files without "extracting" them or seeking through the entire file, this is possible even via mmap, over HTTP range queries, etc.

You can still get the compression benefits by serving files with Content-Encoding: gzip or whatever. Though it has builtin compression, you can just not use that and use external compression instead, especially over the wire.

It's pretty widely used, though often dressed up as something else. JAR files or APK files or whatever.

I think the articles complaints about lacking unix access rights and metadata is a bit strange. That seems like a feature more than a bug, as I wouldn't expect this to be something that transfers between machines. I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.

▲

1718627440

2 hours ago

[-]

Isn't this what is already common in the Python community?

> I don't want to unpack an archive and have to scrutinize it for files with o+rxst permissions, or have their creation date be anything other than when I unpacked them.

I'm the opposite, when I pack and unpack something, I want the files to be identical including attributes. Why should I throw away all the timestamps, just because the file were temporarily in an archive?

▲

password4321

41 minutes ago

[-]

> Why should I throw away all the timestamps, just because the file were temporarily in an archive?

In case anyone is unaware, you don't have to throw away all the timestamps when using "zip with no compression". The metadata for each zipped file includes one timestamp (originally rounded to even number of seconds in local time).

I am a big last modified timestamp fan and am often discouraged that scp and git are not (by default).

▲

rustyhancock

1 hour ago

[-]

Yes, it's a lossy process.

If your archive drops it you can't get it back.

If you don't want it you can just chmod -R u=rw,go=r,a-x

▲

1718627440

1 hour ago

[-]

> If your archive drops it you can't get it back.

Hence, the common archive format is tar not zip.

▲

stabbles

3 hours ago

[-]

> Zip with no compression is a nice contender for a container format that shouldn't be slept on

SquashFS with zstd compression is used by various container runtimes, and is popular in HPC where filesystems often have high latency. It can be mounted natively or with FUSE, and the decompression overhead is not really felt.

▲

ciupicri

1 hour ago

[-]

Wouldn't you still have a lot of syscalls?

▲

stabbles

28 minutes ago

[-]

Yes, but with much lower latency. The squashfs file ensures the files are close together and you benefit from fs cache a lot.

▲

solatic

1 hour ago

[-]

Headline is wrong. I/O wasn't the bottleneck, syscalls were the bottleneck.

Stupid question: why can't we get a syscall to load an entire directory into an array of file descriptors (minus an array of paths to ignore), instead of calling open() on every individual file in that directory? Seems like the simplest solution, no?

▲

cb321

43 minutes ago

[-]

One aspect of the question is that "permissions" are mostly regulated at the time of open and user-code should check for failures. This was a driving inspiration for the tiny 27 lines of C virtual machine in https://github.com/c-blake/batch that allows you to, e.g., synthesize a single call that mmaps a whole file https://github.com/c-blake/batch/blob/64a35b4b35efa8c52afb64... which seems like it would have also helped the article author.

▲

levodelellis

34 minutes ago

[-]

Not sure, I'd like that too

You could use io_uring but IMO that API is annoying and I remember hitting limitations. One thing you could do with io_uring is using openat (the op not the syscall) with the dir fd (which you get from the syscall) so you can asynchronously open and read files, however, you couldn't open directories for some reason. There's a chance I may be remembering wrong

▲

stabbles

39 minutes ago

[-]

What comes closest is scandir [1], which gives you an iterator of direntries, and can be used to avoid lstat syscalls for each file.

Otherwise you can open a dir and pass its fd to openat together with a relative path to a file, to reduce the kernel overhead of resolving absolute paths for each file.

[1] https://man7.org/linux/man-pages/man3/scandir.3.html

▲

arter45

39 minutes ago

[-]

>why can't we get a syscall to load an entire directory into an array of file descriptors (minus an array of paths to ignore), instead of calling open() on every individual file in that directory?

You mean like a range of file descriptors you could use if you want to save files in that directory?

▲

mnahkies

56 minutes ago

[-]

Something that struck me earlier this week was when profiling certain workloads, I'd really like a flame graph that included wall time waiting on IO, be it a database call, filesystem or other RPC.

For example, our integration test suite on a particular service has become quite slow, but it's not particularly clear where the time is going. I suspect a decent amount of time is being spent talking to postgres, but I'd like a low touch way to profile this

▲

trillic

48 minutes ago

[-]

See if you can wrap the underlying library call to pg.query or whatever it is with a generic wrapper that logs time in the query function. Should be easy in a dynamic lang.

▲

Kuinox

16 minutes ago

[-]

Tracing profiler can do exactly that, you don't need a dynamic lang.

▲

stabbles

4 hours ago

[-]

"I/O is the bottleneck" is only true in the loose sense that "reading files" is slow.

Strictly speaking, the bottleneck was latency, not bandwidth.

▲

jchw

50 minutes ago

[-]

Sounds more like the VFS layer/FS is the bottleneck. It would be interesting to try another FS or operating system to see how it compares.

▲

arter45

43 minutes ago

[-]

Some say Mac OS X is (or used to be) slower than Linux at least for certain syscalls.

https://github.com/golang/go/issues/28739#issuecomment-10426...

https://stackoverflow.com/questions/64656255/why-is-the-c-fu...

https://github.com/valhalla/valhalla/issues/1192

https://news.ycombinator.com/item?id=13628320

Not sure what's the root cause, though.

▲

raggi

3 hours ago

[-]

there are a loooot of languages/compilers for which the most wall-time expensive operation in compilation or loading is stat(2) searching for files

▲

ghthor

3 hours ago

[-]

I actually ran into this issue building dependency graphs of a golang monorepo. We analyzed the cpu trace and found that the program was doing a lot of GC so we reduced allocations. This was just noise though as the runtime was just making use of time waiting for I/O as it had shelled out to go list to get a json dep graph from the CLI program. This turns out to be slow due to stat calls and reading from disk. We replaced our usage of go list with a custom package import graph parser using the std lib parser packages and instead of reading from disk we give the parser byte blobs from git, also using git ls-files to “stat” the files. Don’t remember the specifics but I believe we brought the time from 30-45s down to 500ms to build the dep graph.

▲

akaltar

4 hours ago

[-]

Amazing article, thanks for sharing. I really appreciate the deep investigations in response to the comments

▲

nudpiedo

1 hour ago

[-]

Same thing applies to other system aspects:

compressing the kernel loads it faster on RAM even if it still has to execute the un compressing operation. Why?

Load from disk to RAM is a larger bottleneck than CPU uncompressing.

Same is applied to algorithms, always find the largest bottleneck in your dependent executions and apply changes there as the rest of the pipeline waits for it. Often picking the right algorithm “solves it” but it may be something else, like waiting for IO or coordinating across actors (mutex if concurrency is done as it used to).

That’s also part of the counterintuitive take that more concurrency brings more overhead and not necessarily faster execution speeds (topic largely discussed a few years ago with async concurrency and immutable structures).