On all modern processors, that instruction measures wallclock time with a counter which increments at the base frequency of the CPU unaffected by dynamic frequency scaling.
Apart from the great resolution, that time measuring method has an upside of being very cheap, couple orders of magnitude faster than an OS kernel call.
My understanding is that using tsc directly is tricky. The rate might not be constant, and the rate differs across cores. [1]
[1]: https://www.pingcap.com/blog/how-we-trace-a-kv-database-with...
You could cpu pin the thread that's reading the tsc, except you can't pin threads in OpenBSD :p
OpenBSD actually only implemented this optimization relatively recently. Though most TSCs will be invariant, they still need to be trained across cores, and there are other minutiae (sleeping states?) that made it a PITA to implement in a reliable way, and OpenBSD doesn't have as much manpower as Linux. Some of those non-obvious issues would be relevant to someone trying to do this manually, unless they could rely on their specific hardware behavior.
1. Apparently OpenBSD gave up on trying to fix desync'd TSCs. See https://github.com/openbsd/src/commit/78156938567f79506a923c...
2. Relevant OpenBSD kernel code: https://github.com/openbsd/src/blob/master/sys/arch/amd64/am...
3. Relevant Linux kernel code: https://github.com/torvalds/linux/blob/master/arch/x86/kerne..., https://github.com/torvalds/linux/blob/master/arch/x86/kerne...
4. Linux kernel doc (out-of-date?): https://www.kernel.org/doc/Documentation/virtual/kvm/timekee...
5. Detailed SUSE blog post with many links: https://www.suse.com/c/cpu-isolation-nohz_full-troubleshooti...
6. Linux patch (uncommitted?) to attempt to directly sync TSCs: https://lkml.rescloud.iu.edu/2208.1/00313.html
It would be weird, even for AIX, to support POSIX byte range locks and not the much simpler flock.
I’m using Linux rather than AIX.
fcntl(2) locks are supported (as long as they aren't OFD), but flock(2) locks don't work across nodes.
The issue ended up being that my multi-threaded code when running on a single core pinned that core at 100% CPU usage, as expected, but when running it across 4 cores it was running 4 cores at 25% usage each. This resulted in the clock governor turning down the frequency on the cores from ~2GHz to 900MHz and causing the execution speed to drop even worse than just the expected lock contention. It was a fun mystery to dig into for a while.
I'm not sure of the details for when cores end up with different numbers.
If you want that on Windows, well, it’s possible, but you’re going to have to do it asynchronously from a different thread and also compute the offsets your own damn self[2].
Alternatively, on AMD processors only, starting with Zen 2, you can get the real cycle count with __aperf() or __rdpru(__RDPRU_APERF) or manual inline assembly depending on your compiler. (The official AMD docs will admonish you not to assign meaning to anything but the fraction APERF / MPERF in one place, but the conjunction of what they tell you in other places implies that MPERF must be the reference cycle count and APERF must be the real cycle count.) This is definitely less of a hassle, but in my experience the cap_user_rdpmc method on Linux is much less noisy.
[1] https://man7.org/linux/man-pages/man2/perf_event_open.2.html
[2] https://www.computerenhance.com/p/halloween-spooktacular-day...
Are you sure about that?
> time spent in syscalls (if you don’t want to count it)
The time spent in syscalls was the main objective the OP was measuring.
> cycle counter
While technically interesting, most of the time I do my micro-benchmark I only care about wallclock time. Contradictory to what you see in search engines and ChatGPT, RDTSC instruction is not a cycle counter, it’s a high resolution wallclock timer. That instruction was counting CPU cycles like 20 years ago, doesn’t do that anymore.
> Are you sure about that?
> [...] RDTSC instruction is not a cycle counter, it’s a high resolution wallclock timer [...]
So we are in agreement here: with RDTSC you’re not counting cycles, you’re counting seconds. (That’s what I meant by “does not account for frequency scaling”.) I guess there are legitimate reasons to do that, but I’ve found organizing an experimental setup for wall-clock measurements to be excruciatingly difficult: getting 10–20% differences depending on whether your window is open or AC is on, or on how long the rebuild of the benchmark executable took, is not a good time. In a microbenchmark, I’d argue that makes RDTSC the wrong tool even if it’s technically usable with enough work. In other situations, it might be the only tool you have, and then sure, go ahead and use it.
> The time spent in syscalls was the main objective the OP was measuring.
I mean, of course I’m not covering TFA’s use case when I’m only speaking about Linux and Windows, but if you do want to include time in syscalls on Linux that’s also only a flag away. (With a caveat for shared resources—you’re still not counting time in kswapd or interrupt handlers, of course.)
Cycles are a fine thing to measure when trying to reason about pieces of an algorithm and estimate its cost (e.g., latency and throughput tables for assembly instructions are invaluable). They're also a fine thing to measure when frequency scaling is independent of the instructions being executed (since then you can perfectly predict which algorithm will be faster independent of the measurement noise).
That's not the world we live in though. Instructions cause frequency scaling -- some relatively directly (like a cost for switching into heavy avx512 paths on some architectures), some indirectly but predictably (physical limits on moving heat off the chip without cryo units), some indirectly but unpredictably (moving heat out of a laptop casing as you move between having it on your lap and somewhere else). If you just measure instruction counts, you ignore effects like the "faster" algorithm always throttling your CPU 2x because it's too hot.
One of the better use cases for something like RDTSC is when microbenchmarking a subcomponent of a larger algorithm. You take as your prior that no global state is going to affect performance (e.g., not overflowing the branch prediction cache), and then the purpose of the measurement is to compute the delta of your change in situ, measuring _only_ the bits that matter to increase the signal to noise.
In that world, I've never had the variance you describe be a problem. Computers are fast. Just bang a few billion things through your algorithm and compare the distributions. One might be faster on average. One might have better tail latency. Who knows which you'll prefer, but at least you know you actually measured the right thing.
For that matter, even a stddev of 80% isn't that bad. At $WORK we frequently benchmark the whole application even for changes which could be microbencmarked. Why? It's easier. Variance doesn't matter if you just run the test longer.
You have a legitimate point in some cases. E.g., maybe a CLI tool does a heavy amount of work for O(1 second). Thermal throttling will never happen in the real world, but a sustained test would have throttling (and also different branch predictions and whatnot), so counting cycles is a reasonable proxy for the thing you actually care about.
I dunno; it's complicated.
1 - https://lore.kernel.org/all/da9e8bee-71c2-4a59-a865-3dd6c5c9...
What do you do on ARM?
[1]: https://developer.arm.com/documentation/102379/0104/The-proc...
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
On arm64 it directly uses the cntvct_el0 register under the hood but with a standard/easy to use API instead of messing about with inline assembly. Also avoids a context switch because it's vDSO.* https://jdebp.uk/Softwares/djbwares/guide/commands/clockspee...
* https://github.com/jdebp/djbwares/commit/8d2c20930c8700b1786...
Yes, 27 years later it now compiles on a non-Intel architecture. (-:
It's also not as accurate as a the High Precision timer (HPET). I'm not sure which platforms gate/expose which these days but it's a grab bag.
This hasn't been true for about 10 years.
My point about it being disabled on some platforms has historically been true, however.
We do hear that some kernel system calls have moved out from behind the lock over time, but anything requiring core kernel functionality must wait until the kernel is released from all other CPUs.
The kernel may be faster in this exercise, but is potentially a constrained resource on a highly loaded system.
This approach is more secure, however.
> ... in the comments everyone ignores the content and argues about the headline.
Surely you must be new to Hacker News…
Lightweight? Yes.
Minimalist? Definitely.
Compact? Sure.
But fast? No.
Would I host a database or fileserver on OpenBSD? Hell no.
Boot times seem to take as long as they did 20 years ago. They are also advocates for every schizo security mitigation they can dream up that sacrifices speed and that's ok too.
But let's not pretend it's something it's not.
a c compiler
a web server
3 routing daemons(bgpd, ospfd, ripd)
a mail server(smtpd, spamd)
a sound server(sndiod)
a reverse proxy(relayd)
2 desktop environments(fvwm, cwm)
plus many many more
Openbsd is not some minimalist highly focused operating system, I mean what on earth is it actually for? based on the included features, A desktop development system that is also the router and office web and mail server?Personally I love it, after a fresh install I always feel like I could rebuild the internet from scratch using only what is found in front of me if I needed to.
Compare cwm with xfwm. while cwm is brutally minimalist in appearance it has enough extra functionality to be used as it's own desktop environment, xfwm requires several separate parts to be usable on the desktop.
Somewhere along the line may be it will become fast enough and certain applications may use it for different sets of reasons.
This is aslr on steroids, and it does vastly increase kernel attack complexity, but it is a computational and I/O load that no version of Linux imposes that I know.
Relinking the C library is relatively quick in comparison.
Known as OpenBSD kernel address randomized link (KARL)[0][1]
Also, libc, and libcrypto are re-linked at boot [2].
And sshd [3].
[0] https://marc.info/?l=openbsd-tech&m=149732026405941
[1] https://news.ycombinator.com/item?id=14709256
I can't say I agree with your implication that the load is significant. My anno 2020 low-power mobile Ryzen relinks the kernel in exactly 9 seconds. Shrug.
It's entirely possible to disable at-boot kernel relinking if one prefers to, and settle for having the kernel relinked each time there's a patch for it.
Try a Raspberry Pi.
That's quite the statement, in reality KARL is a mild incovenience at best(for a more sober look on it see https://isopenbsdsecu.re/mitigations/karl/)
This is done by expand_fdtable() in the kernel. It contains the following code:
if (atomic_read(&files->count) > 1)
synchronize_rcu();
The field files->count is a reference counter. As there are two threads, which share a set of open files between them, the value of this is 2, meaning that synchronize_rcu() is called here during fdtable expansion. This waits until a full RCU grace period has elapsed, causing a delay in acquiring a new fd for the socket currently being created.If the fdtable is expanded prior to creating a new thread, as the test program optionally will do by calling dup(0, 666) if supplied a command line argument, this avoids the synchronize_rcu() call because at this point files->count == 1. Therefore, if this is done, there will be no delay later on when creating all the sockets as the fdtable will have sufficient capacity.
By contrast, the OpenBSD kernel doesn't have anything like RCU and just uses a rwlock when the file descriptor table of the process is being modified, avoiding the long delay during expansion that may be observed in Linux.
I guess my question is why would synchronize_rcu take many milliseconds (20+) to run. I would expect that to be in the very low milliseconds or less.
I found this to be a key takeaway of reading the full thread: this is, in part, a benchmark of kernel memory allocation approaches, that surfaces an unforeseen difference in FD performance at a mere 256 x 2 allocs. Presumably we’re seeing a test case distilled down from a real world scenario where this slowdown was traced for some reason?
Context: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...
Although:
1. back in application-mode code the language runtime libraries make things look like a POSIX API and maintain their own table mapping object handles to POSIX-like file descriptors, where there is the old contention over the lowest free entries; and
1. in practice the object handle table seems to mostly append, so multiple object-opening threads all contend over the end of the table.
The rust standard library aborts the program in debug builds when it detects a double-close; at that point corruption may already have occurred but better than nothing.
On EBADF? Neat.
Not by just a bit, but it was a difference between 10MB/s and 100MB/s.
/dev/random, on linux used to stall waiting for entropy from sources of randomness like network jitter, mouse movement, keyboard typing. /dev/urandom has always been fast on Linux.
Today, linux /dev/random mainly uses an RNG after initial seeding. The BSDs always did this. On my laptop, I get over 500MB/s (kernel 6.12) .
IIRC, on modern linux kernels, /dev/urandom is now just an alias to /dev/random for backward compatibility.
Both Linux and BSD use a CSPRNG to satisfy /dev/{urandom,random} and getrandom, and, for future-secrecy/compromise-protection continually update their entropy pools with hashed high-entropy events (there's ~essentially no practical cryptographic reason a "seeded" CSPRNG ever needs to be rekeyed, but there are practical systems security reasons to do it).
Related, I stumbled down a rabbit hole of PRNGs last year when I discovered [0] that my Mac was way faster at generating UUIDs than my Linux server, even taking architecture and clock speed into account. Turns out glibc didn’t get arc4random until 2.36, and the version of Debian I had at the time didn’t have 2.36. In contrast, since MacOS is BSD-based, it’s had it for quite some time.
[0]: https://gist.github.com/stephanGarland/f6b7a13585c0caf9eb64b...
For all I know BSD could be doing 31*last or something similar.
The algorithm is also free to change.
But that's also why the rng stuff was so much faster. There was a long period of time where the Linux dev in charge of randomness believed a lot of voodoo instead of actual security practices, and chose nonsense slow systems instead of well-researched fast ones. Linux has finally moved into the modern era, but there was a long period where the randomness features were far inferior to systems built by people with a security background.
(I was involved, somewhat peripherally, in OpenBSD security during the era of the big OpenBSD Security Audit).
Ultimately, they suffer from a lack of developer resources.
Which is a shame because it's a wonderfully integrated system (as opposed to the tattered quilt that is every Linux distro). But I suspect it's the project leadership that keeps more people away.
/s
If you want me to read your site, and you want to put detailed technical information on it, please don't add things like this without an off switch.
Maybe the behavior changed.
Could it be
https://web.archive.org/web/20031020054211if_/http://bulk.fe...
or is there another one
linux: elapsed: 0.019895s
nanos (running on said linux): elapsed: 0.000886s
Triggered on a safari on mac.
(I don’t even know you could actually start mainstream games on BSD or not)
Also latency is frequently worse on Linux. I play a lot of quick twitch games on Linux and Windows and while fps and frame times are generally in the same ballpark, latency is far higher.
Other problems is that proton compatibility is all over the place. Some of the games valve said were certified don't actually work well, mods can be problematic, and generally you end up faffing with custom launch options to get things working well.
This goes to show, that the experience with Proton and different hardware and whatever it is in system configuration is highly individual, but also, that games can indeed run better using WINE or Proton than on the system they were made for.
Often for games that don't work with modern Windows there are fan patches/mods that fix these issues.
For games that are modern frequently have weird framerate issues that rarely happen on Windows. When I am playing a multiplayer, fast twitch game I don't want the framerate to randomly dip.
I was gaming exclusively on Linux from 2019 and gave up earlier this year. I wanted to play Red Alert 2 and trying to work out what to with Wine and all the other stuff was a PITA. It was all easy on Windows.
Now time to read the actual linked discussion.
[1] Off-by-one. To be more precise, the state established by the dup2 is (667 > 256 * 2), or rather (667 > 3 + 256 * 2).
[2] Presumably what OpenBSD is using. I'd be surprised if they've already imported and adopted FreeBSD's approach mentioned in the linked discussion, notwithstanding that OpenBSD has been on an MP scalability tear the past few years.
FreeBSD's equivalent implementation lives here (it does all of this under an exclusive lock): http://fxr.watson.org/fxr/source/kern/kern_descrip.c#L1882
It doesn't use RCU, but does do a kind of bespoke refcounted retention of old versions of the fdtable on a free list.
Faster is all relative. What are you doing? Is it networking? Then BSD is probably faster than Linux. Is it something Linux is optimized for? Then probably Linux.
A general benchmark? Who knows, but does it really matter?
At the end of the day, you should benchmark your own workload, but also it's important to realize that in this day and age, it's almost never the OS that is the bottleneck. It's almost always a remote network call.
the author failed the first step.
everything that follows is then garbage.
the first thing that you learn in that area is that unless you're comparing specific configuration of apples to apples, along with a specific configuration of the workloads, you're basically doing children play.
i mean, it's fine to play and observe, but it's just play. it's not anything to take seriously.
Defaults in current are in etc.amd64/login.conf. https://cvsweb.openbsd.org/cgi-bin/cvsweb/~checkout~/src/etc...
(p.s.: the bubble are cool. highly distracting to me, hence I could not read the article in full.)
I don't get why people keep posting and upvoting articles from this user-hostile site.
flak.tedunangst.com##.bl.shooter
flak.tedunangst.com##.br.shooter
flak.tedunangst.com##div.bullet
https://news.ycombinator.com/newsguidelines.html
Because way more people have opinions about e.g. asteroid game scripts on web pages than have opinions on RCUs, these subthreads spread like kudzu.
You don't have to sacrifice usability while expressing personality.
In this case it was distracting though.
Not sure anyone lost anything here, or anyone cares.