FilterHN

Const-me

2 days ago

[-]

Not sure if that’s relevant, but when I do micro-benchmarks like that measuring time intervals way smaller than 1 second, I use __rdtsc() compiler intrinsic instead of standard library functions.

On all modern processors, that instruction measures wallclock time with a counter which increments at the base frequency of the CPU unaffected by dynamic frequency scaling.

Apart from the great resolution, that time measuring method has an upside of being very cheap, couple orders of magnitude faster than an OS kernel call.

[1]: https://www.pingcap.com/blog/how-we-trace-a-kv-database-with...

sa46

2 days ago

[-]

Isn't gettimeofday implemented with vDSO to avoid kernel context switching (and therefore, most of the overhead)?

My understanding is that using tsc directly is tricky. The rate might not be constant, and the rate differs across cores. [1]

toast0

2 days ago

[-]

I think most current systems have invariant tsc, I skimmed your article and was surprised to see an offset (but not totally shocked), but the rate looked the same.

You could cpu pin the thread that's reading the tsc, except you can't pin threads in OpenBSD :p

wahern

2 days ago

[-]

But just to be clear (for others), you don't need to do that because using RDTSC/RDTSCP is exactly how gettimeofday and clock_gettime work these days, even on OpenBSD. Where using the TSC is practical and reliable, the optimization is already there.

OpenBSD actually only implemented this optimization relatively recently. Though most TSCs will be invariant, they still need to be trained across cores, and there are other minutiae (sleeping states?) that made it a PITA to implement in a reliable way, and OpenBSD doesn't have as much manpower as Linux. Some of those non-obvious issues would be relevant to someone trying to do this manually, unless they could rely on their specific hardware behavior.

RossBencina

2 days ago

[-]

Out of interest, does training across cores result in any residual offset? If so, is the offset nondeterministic?

wahern

7 hours ago

[-]

I was curious myself, poked around, and found some references. But I'm still woefully incapable of answering that with any confidence and don't want to risk saying anything misleading, so here's the code and some other breadcrumbs:

1. Apparently OpenBSD gave up on trying to fix desync'd TSCs. See https://github.com/openbsd/src/commit/78156938567f79506a923c...

2. Relevant OpenBSD kernel code: https://github.com/openbsd/src/blob/master/sys/arch/amd64/am...

3. Relevant Linux kernel code: https://github.com/torvalds/linux/blob/master/arch/x86/kerne..., https://github.com/torvalds/linux/blob/master/arch/x86/kerne...

4. Linux kernel doc (out-of-date?): https://www.kernel.org/doc/Documentation/virtual/kvm/timekee...

5. Detailed SUSE blog post with many links: https://www.suse.com/c/cpu-isolation-nohz_full-troubleshooti...

6. Linux patch (uncommitted?) to attempt to directly sync TSCs: https://lkml.rescloud.iu.edu/2208.1/00313.html

quotemstr

2 days ago

[-]

Wizardly workarounds for broken APIs persist long after those APIs are fixed. People still avoid things like flock(2) because at one time NFS didn't handle file locking well. CLOCK_MONOTONIC_RAW is fine these days with the vDSO.

denotational

2 days ago

[-]

Sadly GPFS still doesn’t support flock(2), so I still avoid it.

quotemstr

2 days ago

[-]

Doesn't it? https://sambaxp.org/archive-data-samba/sxp09/SambaXP2009-DAT...

It would be weird, even for AIX, to support POSIX byte range locks and not the much simpler flock.

denotational

1 day ago

[-]

It doesn't, at least on the version I have access to, as it is configured on that cluster.

I’m using Linux rather than AIX.

fcntl(2) locks are supported (as long as they aren't OFD), but flock(2) locks don't work across nodes.

tonyarkles

2 days ago

[-]

It was a while ago (2009-10ish) but I ran into an exceptionally interesting performance issue that was partly identified with RDTSC. For a course project in grad school I was measuring the effects of the Python GIL when running multi-threaded Python code on multi-core processors. I expected the overhead/lock contention to get worse as I added threads/cores but the performance fell off a cliff in a way that I hadn't expected. Great outcome for a course project, it made the presentation way more interesting.

The issue ended up being that my multi-threaded code when running on a single core pinned that core at 100% CPU usage, as expected, but when running it across 4 cores it was running 4 cores at 25% usage each. This resulted in the clock governor turning down the frequency on the cores from ~2GHz to 900MHz and causing the execution speed to drop even worse than just the expected lock contention. It was a fun mystery to dig into for a while.

Dylan16807

2 days ago

[-]

If you have something newer than a pentium 4 the rate will be constant.

I'm not sure of the details for when cores end up with different numbers.

triknomeister

2 days ago

[-]

TSC is about cycles consumed by a core. Not about actual time. And so for microbenchmarking, it actually makes sense, because you are often much more interested in CPU benchmarks than network benchmarks in microbenchmarking.

ainiriand

2 days ago

[-]

You have to benchmark tsc against a fixed CPU speed, say 1000Mhz, then you have a reliable comparison.

[1] https://man7.org/linux/man-pages/man2/perf_event_open.2.html

mananaysiempre

2 days ago

[-]

This does not account for frequency scaling on laptops, context switches, core migrations, time spent in syscalls (if you don’t want to count it), etc. On Linux, you can get the kernel to expose the real (non-“reference”) cycle counter for you to access with __rdpmc() (no syscall needed) and put the corrective offset in an memory-mapped page. See the example code under cap_user_rdpmc on the manpage for perf_event_open() [1] and NOTE WELL the -1 in rdpmc(idx-1) there (I definitely did not waste an hour on that).

If you want that on Windows, well, it’s possible, but you’re going to have to do it asynchronously from a different thread and also compute the offsets your own damn self[2].

Alternatively, on AMD processors only, starting with Zen 2, you can get the real cycle count with __aperf() or __rdpru(__RDPRU_APERF) or manual inline assembly depending on your compiler. (The official AMD docs will admonish you not to assign meaning to anything but the fraction APERF / MPERF in one place, but the conjunction of what they tell you in other places implies that MPERF must be the reference cycle count and APERF must be the real cycle count.) This is definitely less of a hassle, but in my experience the cap_user_rdpmc method on Linux is much less noisy.

[2] https://www.computerenhance.com/p/halloween-spooktacular-day...

Const-me

2 days ago

[-]

> does not account for frequency scaling on laptops

Are you sure about that?

> time spent in syscalls (if you don’t want to count it)

The time spent in syscalls was the main objective the OP was measuring.

> cycle counter

While technically interesting, most of the time I do my micro-benchmark I only care about wallclock time. Contradictory to what you see in search engines and ChatGPT, RDTSC instruction is not a cycle counter, it’s a high resolution wallclock timer. That instruction was counting CPU cycles like 20 years ago, doesn’t do that anymore.

mananaysiempre

2 days ago

[-]

>> does not account for frequency scaling on laptops

> Are you sure about that?

> [...] RDTSC instruction is not a cycle counter, it’s a high resolution wallclock timer [...]

So we are in agreement here: with RDTSC you’re not counting cycles, you’re counting seconds. (That’s what I meant by “does not account for frequency scaling”.) I guess there are legitimate reasons to do that, but I’ve found organizing an experimental setup for wall-clock measurements to be excruciatingly difficult: getting 10–20% differences depending on whether your window is open or AC is on, or on how long the rebuild of the benchmark executable took, is not a good time. In a microbenchmark, I’d argue that makes RDTSC the wrong tool even if it’s technically usable with enough work. In other situations, it might be the only tool you have, and then sure, go ahead and use it.

> The time spent in syscalls was the main objective the OP was measuring.

I mean, of course I’m not covering TFA’s use case when I’m only speaking about Linux and Windows, but if you do want to include time in syscalls on Linux that’s also only a flag away. (With a caveat for shared resources—you’re still not counting time in kswapd or interrupt handlers, of course.)

hansvm

2 days ago

[-]

Cycles are often not what you're trying to measure with something like this. You care about whether the program has higher latency, higher inverse throughput, and other metrics denominated in wall-clock time.

Cycles are a fine thing to measure when trying to reason about pieces of an algorithm and estimate its cost (e.g., latency and throughput tables for assembly instructions are invaluable). They're also a fine thing to measure when frequency scaling is independent of the instructions being executed (since then you can perfectly predict which algorithm will be faster independent of the measurement noise).

That's not the world we live in though. Instructions cause frequency scaling -- some relatively directly (like a cost for switching into heavy avx512 paths on some architectures), some indirectly but predictably (physical limits on moving heat off the chip without cryo units), some indirectly but unpredictably (moving heat out of a laptop casing as you move between having it on your lap and somewhere else). If you just measure instruction counts, you ignore effects like the "faster" algorithm always throttling your CPU 2x because it's too hot.

One of the better use cases for something like RDTSC is when microbenchmarking a subcomponent of a larger algorithm. You take as your prior that no global state is going to affect performance (e.g., not overflowing the branch prediction cache), and then the purpose of the measurement is to compute the delta of your change in situ, measuring _only_ the bits that matter to increase the signal to noise.

In that world, I've never had the variance you describe be a problem. Computers are fast. Just bang a few billion things through your algorithm and compare the distributions. One might be faster on average. One might have better tail latency. Who knows which you'll prefer, but at least you know you actually measured the right thing.

For that matter, even a stddev of 80% isn't that bad. At $WORK we frequently benchmark the whole application even for changes which could be microbencmarked. Why? It's easier. Variance doesn't matter if you just run the test longer.

You have a legitimate point in some cases. E.g., maybe a CLI tool does a heavy amount of work for O(1 second). Thermal throttling will never happen in the real world, but a sustained test would have throttling (and also different branch predictions and whatnot), so counting cycles is a reasonable proxy for the thing you actually care about.

I dunno; it's complicated.

MortyWaves

2 days ago

[-]

Fascinating how each “standard” or intrinsic that gets added actually totally fails to give you the real numbers promised.

ethan_smith

2 days ago

[-]

While __rdtsc() is fast, be cautious with multi-core benchmarks as TSC synchronization between cores isn't guaranteed on all hardware, especially older systems. Modern Intel/AMD CPUs have "invariant TSC" which helps, but it's worth checking CPU flags first.

2 days ago

[-]

Rdtsc is fast-ish, but it's still like 10 ns. Something to be aware of if you're trying to measure really small durations (measure 10s or 100s and amortize).

zorked

2 days ago

[-]

Invariant TSC has been around for over 15 years, probably more.

1 - https://lore.kernel.org/all/da9e8bee-71c2-4a59-a865-3dd6c5c9...

redleader55

2 days ago

[-]

Succesive rdtsc calls, even on the same CPU, are not guaranteed to be executed in the expected order by the CPU - [1].

oguz-ismail

2 days ago

[-]

> I use __rdtsc() compiler intrinsic

What do you do on ARM?

[1]: https://developer.arm.com/documentation/102379/0104/The-proc...

oasisaimlessly

2 days ago

[-]

Read `cntvct_el0` with the `mrs` instruction. [1]

https://github.com/facebook/folly/blob/main/folly/chrono/Har...

2 days ago

[-]

madog

1 day ago

[-]

Just use gettimeofday/clock_gettime via vDSO.

  struct timespec ts;
  clock_gettime(CLOCK_MONOTONIC, &ts);

On arm64 it directly uses the cntvct_el0 register under the hood but with a standard/easy to use API instead of messing about with inline assembly. Also avoids a context switch because it's vDSO.

* https://jdebp.uk/Softwares/djbwares/guide/commands/clockspee...

JdeBP

2 days ago

[-]

See manual page and changelog.

* https://github.com/jdebp/djbwares/commit/8d2c20930c8700b1786...

Yes, 27 years later it now compiles on a non-Intel architecture. (-:

Narishma

2 days ago

[-]

Use a Raspberry Pi or something.

signa11

2 days ago

[-]

I don't think it is (guaranteed to be) synchronized across cores. I might be wrong about that though.

junon

2 days ago

[-]

rdtsc isn't available on all platforms, for what it's worth. It's often disabled as there's a CPU flag to allow its use in user space, and it's well know to not be so accurate.

2 days ago

[-]

What platforms disable rdtsc for userspace? What accuracy issues do you think it has?

junon

2 days ago

[-]

rdtsc instruction access is gated by a permission bit. Sometimes it's allowed from userspace, sometimes it's not. There were issues with it in the past, I forget which off the top of my head.

It's also not as accurate as a the High Precision timer (HPET). I'm not sure which platforms gate/expose which these days but it's a grab bag.

2 days ago

[-]

Personally I'm not aware of any platform blocking rdtsc, so I was curious to learn which ones do.

bonzini

2 days ago

[-]

> It's also not as accurate as a the High Precision timer (HPET)

This hasn't been true for about 10 years.

junon

2 days ago

[-]

You're right, I was thinking about the interrupt precision over the default APIC timer.

My point about it being disabled on some platforms has historically been true, however.

bonzini

2 days ago

[-]

I think you're confusing this and the kernel's blacklisting of the TSC for timekeeping if it is not synchronized across CPUs; but while there's a knob to block userspace's access to the TSC, I am not sure that has been used anywhere except for debugging reasons (e.g. record/replay).

mrlongroots

2 days ago

[-]

They could've just used `clock_gettime(CLOCK_MONOTONIC)`

sweetjuly

2 days ago

[-]

A better title: a pathological test program meant for Linux does not trigger pathological behavior on OpenBSD

apgwoz

2 days ago

[-]

Surely you must be new to tedu posts…

ameliaquining

2 days ago

[-]

Still worth avoiding having the HN thread be about whether OpenBSD is in general faster than Linux. This is a thing I've seen a bunch of times recently, where someone gives an attention-grabbing headline to a post that's actually about a narrower and more interesting technical topic, but then in the comments everyone ignores the content and argues about the headline.

2 days ago

[-]

As I understand it, OpenBSD is similar to Linux 2.2 architecturally in that there is a lock that prevents (most) kernel code from running on more than one (logical) CPU at once.

We do hear that some kernel system calls have moved out from behind the lock over time, but anything requiring core kernel functionality must wait until the kernel is released from all other CPUs.

The kernel may be faster in this exercise, but is potentially a constrained resource on a highly loaded system.

This approach is more secure, however.

cenamus

2 days ago

[-]

Yeah they're making progress in unlocking most syscalls at the moment, but as far as I understand they're still some ways behind FreeBSD

Gud

2 days ago

[-]

FreeBSD removed their “lock” decades ago.

2 days ago

[-]

I believe BSDi, the entity in the AT&T lawsuit in the 90s, did the work.

https://man.freebsd.org/cgi/man.cgi?smp

1 day ago

[-]

"The SMPng Project relied heavily on the support of BSDi, who provided reference source code from the fine-grained SMP implementation found in BSD/OS.

2 days ago

[-]

I believe this was FreeBSD 5 (2003), yes.

st_goliath

2 days ago

[-]

> This is a thing I've seen a bunch of times recently ...

> ... in the comments everyone ignores the content and argues about the headline.

Surely you must be new to Hacker News…

ameliaquining

2 days ago

[-]

It was more that there were a couple of particularly frustrating recent examples that happened to come to my attention. Of course this has always been a problem.

its-summertime

2 days ago

[-]

I don't think I'd ever want someone to consider the hacker news audience when writing a post

dwedge

2 days ago

[-]

A company I used to work at would write posts, put it on hacker news, post the HN link in slack and ask everyone with an account to upvote it

frou_dh

2 days ago

[-]

I'm guessing the HN backend detects "voting rings" like that who often vote together.

sugarpimpdorsey

2 days ago

[-]

OpenBSD is many things, but 'fast' is not a word that comes to mind.

Lightweight? Yes.

Minimalist? Definitely.

Compact? Sure.

But fast? No.

Would I host a database or fileserver on OpenBSD? Hell no.

Boot times seem to take as long as they did 20 years ago. They are also advocates for every schizo security mitigation they can dream up that sacrifices speed and that's ok too.

But let's not pretend it's something it's not.

somat

2 days ago

[-]

Gotta disagree with you on your minimalist verdict, out of the box openbsd tends to have everything and the kitchen sink, I am not complaining, it is good solid well written software. but I have never found a base linux distro that was as well stocked as a base openbsd install.

    a c compiler
    a web server
    3 routing daemons(bgpd, ospfd, ripd)
    a mail server(smtpd, spamd)
    a sound server(sndiod)
    a reverse proxy(relayd)
    2 desktop environments(fvwm, cwm)
    plus many many more

Openbsd is not some minimalist highly focused operating system, I mean what on earth is it actually for? based on the included features, A desktop development system that is also the router and office web and mail server?

Personally I love it, after a fresh install I always feel like I could rebuild the internet from scratch using only what is found in front of me if I needed to.

Paianni

2 days ago

[-]

fvwm and cwm are window managers, definitely not DEs.

somat

1 day ago

[-]

They are mainly window managers, that is their primary interface, but they also launch programs and show status, so I think of them as fat window managers and because they tend to be the only environment on the desktop, they are thin desktop environments.

Compare cwm with xfwm. while cwm is brutally minimalist in appearance it has enough extra functionality to be used as it's own desktop environment, xfwm requires several separate parts to be usable on the desktop.

ksec

2 days ago

[-]

I think OpenBSD has gotten quite a bit faster in the last 4 - 5 releases. Unfortunately we will need to wait for another round of benchmarks by phoronix as it seems to have problems every time benchmarks are ran.

Somewhere along the line may be it will become fast enough and certain applications may use it for different sets of reasons.

anthk

2 days ago

[-]

OpenBSD runs much faster than 20 years ago thanks to sending locks to /dev/null.

ThinkBeat

2 days ago

[-]

You must run a different branch of OpenBSD than I.

2 days ago

[-]

In some defense of the parent post, a new kernel is relinked at every boot. This load is noticeable.

This is aslr on steroids, and it does vastly increase kernel attack complexity, but it is a computational and I/O load that no version of Linux imposes that I know.

Relinking the C library is relatively quick in comparison.

[0] https://marc.info/?l=openbsd-tech&m=149732026405941

sillywalk

2 days ago

[-]

> a new kernel is relinked at every boot

Known as OpenBSD kernel address randomized link (KARL)[0][1]

Also, libc, and libcrypto are re-linked at boot [2].

And sshd [3].

[1] https://news.ycombinator.com/item?id=14709256

[2] https://news.ycombinator.com/item?id=14710180

[3] https://marc.info/?l=openbsd-cvs&m=167407459325339&w=2

daneel_w

2 days ago

[-]

> In some defense of the parent post, a new kernel is relinked at every boot. This load is noticeable.

I can't say I agree with your implication that the load is significant. My anno 2020 low-power mobile Ryzen relinks the kernel in exactly 9 seconds. Shrug.

It's entirely possible to disable at-boot kernel relinking if one prefers to, and settle for having the kernel relinked each time there's a patch for it.

2 days ago

[-]

I appreciate your quibble, but some people are using more modest equipment.

Try a Raspberry Pi.

daneel_w

1 day ago

[-]

They're certainly not blazing, especially not if running on SD-based storage, which is horrifically slow at doing the intense small random writes this type of process entails. But this doesn't really say anything about OpenBSD, only about this type of hardware. It would after all fall equally short if you were to link the Linux kernel under Linux, or the FreeBSD kernel under FreeBSD, or start Chromium with two dozen tabs to be restored, etc.

1 day ago

[-]

A resizable ramdisk could address the I/O problem.

rs_rs_rs_rs_rs

2 days ago

[-]

>aslr on steroids, and it does vastly increase kernel attack complexity

That's quite the statement, in reality KARL is a mild incovenience at best(for a more sober look on it see https://isopenbsdsecu.re/mitigations/karl/)

cout

2 days ago

[-]

Interesting. I tried to follow the discussion in the linked thread, and the only takeaway I got was "something to do with RCU". What id the simplified explanation?

bobby_big_balls

2 days ago

[-]

In Linux, the file descriptor table (fdtable) of a process starts with a minimum of 256 slots. Two threads creating 256 sockets each, which uses 512 fds on top of the three already present (for stdin, stdout and stderr), requires that the fdtable be expanded about halfway through when the capacity is doubled from 256 to 512, and again near the end when resizing from 512 to 1024.

This is done by expand_fdtable() in the kernel. It contains the following code:

        if (atomic_read(&files->count) > 1)
          synchronize_rcu();

The field files->count is a reference counter. As there are two threads, which share a set of open files between them, the value of this is 2, meaning that synchronize_rcu() is called here during fdtable expansion. This waits until a full RCU grace period has elapsed, causing a delay in acquiring a new fd for the socket currently being created.

If the fdtable is expanded prior to creating a new thread, as the test program optionally will do by calling dup(0, 666) if supplied a command line argument, this avoids the synchronize_rcu() call because at this point files->count == 1. Therefore, if this is done, there will be no delay later on when creating all the sockets as the fdtable will have sufficient capacity.

By contrast, the OpenBSD kernel doesn't have anything like RCU and just uses a rwlock when the file descriptor table of the process is being modified, avoiding the long delay during expansion that may be observed in Linux.

https://www.youtube.com/watch?v=9rNVyyPjoC4

2 days ago

[-]

RCUs are super interesting; here's (I think I've got the right link) a good talk on how they work and why they work that way:

dekhn

2 days ago

[-]

Thanks for the explanation. I confirmed the performance timing different by enabling the dup call.

I guess my question is why would synchronize_rcu take many milliseconds (20+) to run. I would expect that to be in the very low milliseconds or less.

altairprime

2 days ago

[-]

> allocating kernel objects from proper virtual memory makes this easier. Linux currently just allocates kernel objects straight out of the linear mapping of all physical memory

I found this to be a key takeaway of reading the full thread: this is, in part, a benchmark of kernel memory allocation approaches, that surfaces an unforeseen difference in FD performance at a mere 256 x 2 allocs. Presumably we’re seeing a test case distilled down from a real world scenario where this slowdown was traced for some reason?

2 days ago

[-]

That’s how they’re designed; they are intended to complete at some point that’s not soon. There’s an “expedited RCU” which to my understanding tries to get everyone past the barrier as fast as possible by yelling at them but I don’t know if that would be appropriate here.

2 days ago

[-]

> This is done by expand_fdtable() in the kernel. It contains the following code:

Context: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

neerajsi

2 days ago

[-]

Reading this code for the first time, this seems to be a consequence of the separation between allocating and fd and "installing" a pointer to a file there. Allocating the fd already needs to acquire a lock. So if the install happens together with allocation, there wouldn't be a need to use synchronize_rcu to kick out other threads. The lock would do that.

viraptor

2 days ago

[-]

When 2 threads are allocating sockets sequentially, they fight for the locks. If you preallocate a bigger table by creating fd 666 first, the lock contention goes away.

JdeBP

2 days ago

[-]

It's something that has always been interesting about Windows NT, which has a multi-level object handle table, and does not have the rule about re-using the lowest numbered available table index. There's scope for reducing contention amongst threads in such an architecture.

Although:

1. back in application-mode code the language runtime libraries make things look like a POSIX API and maintain their own table mapping object handles to POSIX-like file descriptors, where there is the old contention over the lowest free entries; and

1. in practice the object handle table seems to mostly append, so multiple object-opening threads all contend over the end of the table.

the8472

2 days ago

[-]

lowest-number-reuse is also a robustness issue. If multi-threaded programs UAF or double-free a file descriptor they likely end up touching FDs owned by other parts of the program which can result in various kinds of corruption and memory-unsafety. Assigning numbers from a large domain, either randomly or from a permutation sequence, would massively reduce that probability and manifest in prompt errors instead.

2 days ago

[-]

I want an alternative unix ABI that doesn't guarantee lowest-reuse for this exact reason. I suppose you could (almost) just hijack the close call and replace it with a dup2 of /dev/null or something (but then you'd miss close errors).

the8472

1 day ago

[-]

It could be emulated in userspace with F_DUPFD. But that's costly because the kernel table is optimized for dense data, not sparse.

The rust standard library aborts the program in debug builds when it detects a double-close; at that point corruption may already have occurred but better than nothing.

1 day ago

[-]

> The rust standard library aborts the program in debug builds when it detects a double-close

On EBADF? Neat.

2 days ago

[-]

RCU is very explicitly a lockless synchronization strategy.

2 days ago

[-]

This isn't why it's slow. 2x256 just isn't a lot of locks in CPU time.

ginko

2 days ago

[-]

The problem with locking isn’t the overhead of the locking mechanism itself but the necessary serialization.

2 days ago

[-]

Again, mutex contention / serialization is not why this is slow.

rurban

3 days ago

[-]

No, generally Linux is at least 3x faster than OpenBSD, because they don't care much for optimizations.

farhaven

2 days ago

[-]

OpenBSD is a lot faster in some specialized areas though. Random number generation from `/dev/urandom`, for example. When I was at university (in 2010 or so), it was faster to read `/dev/urandom` on my OpenBSD laptop and pipe it over ethernet to a friend's Linux laptop than running `cat /dev/urandom > /dev/sda` directly on his.

Not by just a bit, but it was a difference between 10MB/s and 100MB/s.

sillystuff

2 days ago

[-]

I think you meant to say /dev/random, not /dev/urandom.

/dev/random, on linux used to stall waiting for entropy from sources of randomness like network jitter, mouse movement, keyboard typing. /dev/urandom has always been fast on Linux.

Today, linux /dev/random mainly uses an RNG after initial seeding. The BSDs always did this. On my laptop, I get over 500MB/s (kernel 6.12) .

IIRC, on modern linux kernels, /dev/urandom is now just an alias to /dev/random for backward compatibility.

2 days ago

[-]

There's no reason for normal userland code not part of the distribution itself ever to use /dev/random, and getrandom(2) with GRND_RANDOM unset is probably the right answer for everything.

Both Linux and BSD use a CSPRNG to satisfy /dev/{urandom,random} and getrandom, and, for future-secrecy/compromise-protection continually update their entropy pools with hashed high-entropy events (there's ~essentially no practical cryptographic reason a "seeded" CSPRNG ever needs to be rekeyed, but there are practical systems security reasons to do it).

[0]: https://gist.github.com/stephanGarland/f6b7a13585c0caf9eb64b...

sgarland

2 days ago

[-]

OpenBSD switched their PRNG to arc4random in 2012 (and then ChaCha20 in 2014); depending on how accurate your time estimate is, that could well have been the cause. Linux switched to ChaCha20 in 2016.

Related, I stumbled down a rabbit hole of PRNGs last year when I discovered [0] that my Mac was way faster at generating UUIDs than my Linux server, even taking architecture and clock speed into account. Turns out glibc didn’t get arc4random until 2.36, and the version of Debian I had at the time didn’t have 2.36. In contrast, since MacOS is BSD-based, it’s had it for quite some time.

somat

2 days ago

[-]

At one point probably 10 years ago I had linux vm guests refuse to generate gpg keys, gpg insisted it needed the stupid blocking random device, and because the vm guest was not getting any "entropy" the process went nowhere. As an openbsd user naturally I was disgusted, there are many sane solutions to this problem, but I used none of them. Instead I found rngd a service to accept "entropy" from a network source and blasted it with the /dev/random from a fresh obsd guest on the same vm host. Mainly out of spite. "look here you little shit, this is how you generate random numbers"

[0] https://wiki.qemu.org/Features/VirtIORNG

craftkiller

2 days ago

[-]

Qemu added support for VirtIO RNG in 2012 [0] so depending on how accurate that 10 year figure is, you also could have used that to make your VM able to use the host system's entropy.

cogman10

2 days ago

[-]

/dev/urandom isn't a great test, IMO, simply because there are reasonable tradeoffs in security v speed.

For all I know BSD could be doing 31*last or something similar.

The algorithm is also free to change.

chowells

2 days ago

[-]

Um... This conversation is about OpenBSD, making that objection incredibly funny. OpenBSD has a mostly-deserved reputation for doing the correct security thing first, in all cases.

But that's also why the rng stuff was so much faster. There was a long period of time where the Linux dev in charge of randomness believed a lot of voodoo instead of actual security practices, and chose nonsense slow systems instead of well-researched fast ones. Linux has finally moved into the modern era, but there was a long period where the randomness features were far inferior to systems built by people with a security background.

2 days ago

[-]

OpenBSD isn't meaningfully more secure than Linux. It probably was 20 years ago. Today it's more accurate to say that Linux and OpenBSD have pursued different security strategies --- there are meaningful differences, but they aren't on a simple one-dimensional spectrum of "good" to "bad".

(I was involved, somewhat peripherally, in OpenBSD security during the era of the big OpenBSD Security Audit).

sugarpimpdorsey

2 days ago

[-]

Haven't they had some embarrassing RCEs in the not too distant past? It kind of calls into question the significance of that claim about holes "in the default install" - even Windows ships without any services exposed these days.

Ultimately, they suffer from a lack of developer resources.

Which is a shame because it's a wonderfully integrated system (as opposed to the tattered quilt that is every Linux distro). But I suspect it's the project leadership that keeps more people away.

user3939382

2 days ago

[-]

I’ve found the OpenBSD community to have a bad/snobbish attitude which could just be a coincidence, no idea. I’ve always liked NetBSD which I never had that problem with.

2 days ago

[-]

My experience is that they expect you to read the docs and ask smart questions. Most of everything is in the documentations, READMEs etc.

user3939382

1 day ago

[-]

Yeah read the docs like their years of invalidated articles, howtos, published books, and more when they decided to make bc breaks in pf.conf for anyone who trusted them to make a firewall that could be upgraded without a site visit.

sgarland

2 days ago

[-]

The horror.

kaskjdhaas

2 days ago

[-]

I remember a discussion with an OpenBSD developer whose answer to the lack of a journaling file system was to simply have a UPS, like any normal computer user should have (there are hobby operating systems with journaling FS, but due to the antique development model, OpenBSD developers can't do significant work like a new file system).

anthk

2 days ago

[-]

They could port WAPBL from NetBSD in no time.

shmerl

2 days ago

[-]

It's talking about a specific case: https://infosec.exchange/@jann/115022521451508325

kleiba

2 days ago

[-]

Pretty broad statement(s).

nicce

2 days ago

[-]

Rather they care about the security. The same mitigations in Linux would likely make it even slower.

themafia

2 days ago

[-]

Yea, well, I had to modify your website to make it readable. Why do people do this?

opan

2 days ago

[-]

It looks good to me on mobile. High contrast, lines aren't too long. What issue did you have?

themafia

2 days ago

[-]

There are two widgets in the lower left and lower right corners of the page which constantly shoot little bullets all over the screen chasing your mouse pointer. In the steady state on my screen there are about 20 sprites constantly moving across the page. There was no obvious way to disable them other than inspecting the page and deleting the widgets from the page.

If you want me to read your site, and you want to put detailed technical information on it, please don't add things like this without an off switch.

GuB-42

2 days ago

[-]

Weird, as soon as my mouse pointer gets hit by one of the bullets, the shooters disappear until I refresh the page. Not getting hit for more than a few seconds is actually rather tricky as the rate of fire increases. I didn't even notice them at first as my mouse pointer got shot by the first couple of bullets.

Maybe the behavior changed.

antennafirepla

2 days ago

[-]

Very frustrating, I read with the mouse pointer and this made it impossible.

SoftTalker

2 days ago

[-]

That was added fairly recently I’d guess as a joke but I’m not in on the reasons.

https://web.archive.org/web/20031020054211if_/http://bulk.fe...

1vuio0pswjnm7

2 days ago

[-]

"Usually it's the weirdo benchmark that shows OpenBSD being 10x slower, so this one is definitely going in the collection."

Could it be

or is there another one

eyberg

2 days ago

[-]

This is kind of a stupid "benchmark" but if we're going to walk down this road:

linux: elapsed: 0.019895s

nanos (running on said linux): elapsed: 0.000886s

agambrahma

2 days ago

[-]

So ... essentially testing file descriptor allocation overhead

2 days ago

[-]

Sort of. Fd table size, which is slightly different than fds (once you reach the ulimit, there's no need to resize it larger); and only in multithreaded programs.

GTP

2 days ago

[-]

By leaving my finger on the screen, I accidentally triggered an easter egg of two "cannons" shooting squares. Did anyone else notice it?

evanjrowley

2 days ago

[-]

I also saw it, and it happened on a non-touch computer screen.

IFC_LLC

2 days ago

[-]

I'll be honest, this was the first time when I was unhappy with an easter egg and was unable to finish reading the article because of it.

Triggered on a safari on mac.

pan69

2 days ago

[-]

Happened for me on my normal desktop browser, cute but distracting. It also made my mouse cursor disappear. I had to move my mouse outside the browser window to make it visible again.

haunter

2 days ago

[-]

In my mind faster = the same game with the same graphics settings have more FPS

(I don’t even know you could actually start mainstream games on BSD or not)

nine_k

2 days ago

[-]

Isn't it mostly limited by GPU hardware, and by binary blobs that are largely independent from the host platform?

https://news.ycombinator.com/item?id=44381144

haunter

2 days ago

[-]

Games run better under Linux (even if they are not-native but with Proton/Wine) than on Windows 11 so the platform does matter

2 days ago

[-]

It annoys me when people claim this. It depends on the game, distro, proton version, what desktop environment, plus a lot of other things I have forgotten about.

Also latency is frequently worse on Linux. I play a lot of quick twitch games on Linux and Windows and while fps and frame times are generally in the same ballpark, latency is far higher.

Other problems is that proton compatibility is all over the place. Some of the games valve said were certified don't actually work well, mods can be problematic, and generally you end up faffing with custom launch options to get things working well.

zelphirkalt

2 days ago

[-]

Many of those games mysteriously fail to work for me, almost like Proton has a problem on my system in general and I am unable to figure it out. However, in the past I got games that are made for Windows to work better on WINE than on Windows. One of those games is Starcraft 2 when it came out. On Windows it would always freeze in one movie/sequence of the single player campaign, which made it actually unplayable on Windows, while after some trial and error, I managed to get a fully working game on GNU/Linux, and was able to finish the campaign.

This goes to show, that the experience with Proton and different hardware and whatever it is in system configuration is highly individual, but also, that games can indeed run better using WINE or Proton than on the system they were made for.

2 days ago

[-]

Consistency is better than any theoretical FPS improvements IMO.

Often for games that don't work with modern Windows there are fan patches/mods that fix these issues.

For games that are modern frequently have weird framerate issues that rarely happen on Windows. When I am playing a multiplayer, fast twitch game I don't want the framerate to randomly dip.

I was gaming exclusively on Linux from 2019 and gave up earlier this year. I wanted to play Red Alert 2 and trying to work out what to with Wine and all the other stuff was a PITA. It was all easy on Windows.

dang

2 days ago

[-]

The article title is too baity to fit HN's guidelines (https://news.ycombinator.com/newsguidelines.html) so I replaced it with a phrase from the article that's hopefully just baity enough.

JdeBP

2 days ago

[-]

I was just about to point Ian Betteridge at the original title. (-:

dang

2 days ago

[-]

Just don't joke that he has retired! https://news.ycombinator.com/item?id=10393754

1970-01-01

2 days ago

[-]

A better way to do that (sorry) is to increase the loading instead of modifying the test. I'm happy to hear reasons why their way could be a better benchmark.

2 days ago

[-]

The phenomenon being measured on Linux is a ~one-off event in a process' lifetime.

asveikau

2 days ago

[-]

My guess is it has something to do with the file descriptor table having a lot of empty entries (the dup2(0, 666) line.)

Now time to read the actual linked discussion.

wahern

2 days ago

[-]

I think dup2 is the hint, but in the example case the dup2 path isn't invoked--it's conditioned on passing an argument, but the test runs are just `./a.out`. IIUC, the issue is growing the file descriptor table. The dup2 is a workaround that preallocates a larger table (666 > 256 * 2)[1], to avoid the pathological case when a multi-threaded process grows the table. From the linked infosec.exchange discussion it seems the RCU-based approach Linux is using can result in some significant latency, resulting in much worse performance in simple cases like this compared to a simple mutex[2].

[1] Off-by-one. To be more precise, the state established by the dup2 is (667 > 256 * 2), or rather (667 > 3 + 256 * 2).

[2] Presumably what OpenBSD is using. I'd be surprised if they've already imported and adopted FreeBSD's approach mentioned in the linked discussion, notwithstanding that OpenBSD has been on an MP scalability tear the past few years.

u64x086a

2 days ago

[-]

What about this on FreeBSD? It's possible to reach the same results?

2 days ago

[-]

FreeBSD doesn't use RCU in the kernel, so it is unlikely to have the exact same ~bug as Linux here. Any naive implementation will have the same decent performance OpenBSD does for this program.

FreeBSD's equivalent implementation lives here (it does all of this under an exclusive lock): http://fxr.watson.org/fxr/source/kern/kern_descrip.c#L1882

It doesn't use RCU, but does do a kind of bespoke refcounted retention of old versions of the fdtable on a free list.

u64x086a

2 days ago

[-]

What about the FreeBSD?

jedberg

2 days ago

[-]

"It depends"

Faster is all relative. What are you doing? Is it networking? Then BSD is probably faster than Linux. Is it something Linux is optimized for? Then probably Linux.

A general benchmark? Who knows, but does it really matter?

At the end of the day, you should benchmark your own workload, but also it's important to realize that in this day and age, it's almost never the OS that is the bottleneck. It's almost always a remote network call.

2 days ago

[-]

The article is marginally more interesting than the headline, and it's not very long. Go ahead and read it.

jedberg

1 day ago

[-]

I did read it. It talked about one specific use case, which isn't very interesting in and of itself.

znpy

2 days ago

[-]

the first step in benchmarking software is to use the same hardware.

the author failed the first step.

everything that follows is then garbage.

2 days ago

[-]

You do understand that people who know how to benchmark things don’t actually need to conform to the rules of thumb that are given to non-experts so they don’t shoot themselves in the foot, right? Do you also write off rally drivers because they have their feet on both pedals?

znpy

2 days ago

[-]

i don't know about you but i did my fair share of benchmarking and performance engineering.

the first thing that you learn in that area is that unless you're comparing specific configuration of apples to apples, along with a specific configuration of the workloads, you're basically doing children play.

i mean, it's fine to play and observe, but it's just play. it's not anything to take seriously.

1 day ago

[-]

Play is not garbage. A lot of performance engineering is not taking yourself seriously. Signed, someone who is a professional performance engineer :)

owl_vision

2 days ago

[-]

What were the resource setups? How do they compare to linux resource setups?

Defaults in current are in etc.amd64/login.conf. https://cvsweb.openbsd.org/cgi-bin/cvsweb/~checkout~/src/etc...

(p.s.: the bubble are cool. highly distracting to me, hence I could not read the article in full.)

2 days ago

[-]

The Linux one is essentially measuring a (surprising) sleep; not resource limited.

M_r_R_o_b_o_t_

2 days ago

[-]

the_plus_one

2 days ago

[-]

Is it just me, or is there some kind of asteroid game shooting bullets at my cursor while I try to read this [1]? I hate to sound mean, but it's a bit distracting. I guess it's my fault for having JavaScript enabled.

[1]: https://flak.tedunangst.com/script.js

lilyball

2 days ago

[-]

It's extremely distracting. I'm not normally one to have issues that require reduced motion, but the asteroids are almost distracting enough on their own, and the fact that it causes my cursor to vanish is a real accessibility issue. I didn't actually realize just how much I use my mouse cursor when reading stuff until now, partly as a fidget, partly as a controllable visual anchor as my eyes scan the page.

joemi

2 days ago

[-]

I actually can't read things on that site at all. I move my mouse around while reading, not necessarily near the words I'm currently reading, so when my mouse disappears it's haltingly distracting. In addition to that, the way the "game" visually interferes with the text that I'm trying to read makes it incredibly hard to focus on reading. These two things combine to make this site literally unreadable for me.

I don't get why people keep posting and upvoting articles from this user-hostile site.

binarycrusader

2 days ago

[-]

I found it exceedingly difficult to read, so I ended up applying these ublock filter rules so I could read it:

  flak.tedunangst.com##.bl.shooter
  flak.tedunangst.com##.br.shooter
  flak.tedunangst.com##div.bullet

jackyard86

2 days ago

[-]

It still hides your cursor though.

nomel

2 days ago

[-]

And, if it hits, your cursor disappears! I wish there was some explosion.

https://news.ycombinator.com/newsguidelines.html

2 days ago

[-]

Please don't complain about tangential annoyances—e.g. article or website formats, name collisions, or back-button breakage. They're too common to be interesting.

Because way more people have opinions about e.g. asteroid game scripts on web pages than have opinions on RCUs, these subthreads spread like kudzu.

ummonk

2 days ago

[-]

The described behavior sounds like significantly worse than tangential annoyance, and isn’t really a common occurrence even on modern user-hostile websites.

ccapitalK

2 days ago

[-]

It's sites like this that make me extremely grateful for firefox's reader mode.

snvzz

2 days ago

[-]

clean mirror:

https://archive.is/wMECM

bigstrat2003

2 days ago

[-]

No, it's the website's fault for doing stupid cutesy stuff that makes the page harder to read. Don't victim-blame yourself here.

stavros

2 days ago

[-]

I really don't understand this "everything must be 100% serious all the time". Why is it stupid?

Gualdrapo

2 days ago

[-]

The HN hivemind decries the lack of humanity and personality of the internet of nowadays but at the same time wants every website to be 100% text, no JS, no CSS because allegedly nobody needs CSS and, if you dare to do something remotely "fancy" with the layout, you have to build it with <table>s.

jackyard86

2 days ago

[-]

This is not about "lack of humanity", but about violating fundamental UX rules such as hiding your cursor at random times. It's offensive.

You don't have to sacrifice usability while expressing personality.

2 days ago

[-]

I barely noticed it. The complaints I've read about it are making a mountain out of a molehill.

apodik

2 days ago

[-]

I generally think stuff like that make the web much more interesting.

In this case it was distracting though.

forbiddenlake

2 days ago

[-]

I didn't read the article because all the moving bits were too distracting. Something also turned my cursor invisible, which is rude.

Not sure anyone lost anything here, or anyone cares.

nomel

2 days ago

[-]

You should ask for a refund!

q3k

2 days ago

[-]

god forbid people have fun on the internet

Jtsummers

2 days ago

[-]

He used to have a loading screen that did nothing if you have JS enabled in your browser, but no loading screen (which, again, did nothing) if you had JS disabled. I'm pretty sure it's meant to deliberately annoy, though this one is less annoying than the loading screen was.