FilterHN

6 months ago

[-]

A more meaningful adventure into microbenchmarking than my last. I look at why we no longer need to P/Invoke memcmp to efficiently compare arrays in C# / .NET.

Old stackoverflow answers are a dangerous form of bit-rot. They get picked up by well-meaning developers and LLMs alike and recreated years after they are out of date.

6 months ago

[-]

For loop regression in .NET 9, please submit an issue at dotnet/runtime. It’s yet another loop tearing miscompilation caused by suboptimal loop lowering changes if my guess is correct.

6 months ago

[-]

No problem, I've raised the issue as https://github.com/dotnet/runtime/issues/114047 .

6 months ago

[-]

Thanks!

jve

6 months ago

[-]

19 Hours in and that PR has already hands on from multiple people at MS. Incredible.

https://github.com/dotnet/runtime/issues/114047#issuecomment...

6 months ago

[-]

UPD: For those interested, it was an interaction between microbenchmark algorithm and tiered compilation and not a regression.

6 months ago

[-]

This is a ten line function that takes half a second to run.

Why do you have to call it more than 50 times before it gets fully optimized?? Is the decision-maker completely unaware of the execution time?

andyayers

6 months ago

[-]

Long-running methods (like the one here) transition mid-execution to more optimized versions, via on-stack replacement (OSR), after roughly 50K iterations. So you end up running optimized code either if the method is called a lot or loops frequently.

The OSR transition happens here, but between .net8 and .net9 some aspects of loop optimizations in OSR code regressed.

6 months ago

[-]

So there actually was a regression and it wasn't an intentional warmup delay?

andyayers

6 months ago

[-]

There indeed is a regression if the method is only called a few times. But not if it is called frequently.

With BenchmarkDotNet it may not be obvious which scenario you intend to measure and which one you end up measuring. BDN runs the benchmark method enough times to exceed some overall "goal" time for measuring (250 ms I think). This may require many calls or may just require one.

6 months ago

[-]

> Why do you have to call it more than 50 times before it gets fully optimized?? Is the decision-maker completely unaware of the execution time?

If you read the linked conversation, you'll notice that there are multiple factors at play.

Here's the document that roughly outlines the tiered compilation and DPGO flows: https://github.com/dotnet/runtime/blob/main/docs/design/feat... note that it may be slightly dated since the exact tuning is subject to change between releases

lozenge

6 months ago

[-]

The optimiser doesn't know how long optimisation will take or how much time it will save before starting the work, therefore it has to hold off on optimising not frequently called functions.

There are also often multiple concrete types that can be passed in, optimising for one will not help if it is also getting called with other concrete types.

6 months ago

[-]

> The optimiser doesn't know how long optimisation will take or how much time it will save before starting the work, therefore it has to hold off on optimising not frequently called functions.

I don't buy that logic.

It can use the length of the function to estimate how long it will take.

It can estimate the time savings by the total amount of time the function uses. Time used is a far better metric than call count. And the math to track it is not significantly more complicated than a counter.

gavinray

6 months ago

[-]

  > It can use the length of the function to estimate how long it will take.

Ah, yes, because a function that defines and then prints a 10,000 line string will take x1,000 longer to run than a 10 line function which does matrix multiplication over several billion elements.

high_na_euv

6 months ago

[-]

I think he meant how long it will take to optimize it

It is naive eitherway

6 months ago

[-]

It's naive but it's so so much better than letting a single small function run for 15 CPU seconds and deciding it's still not worth optimizing it yet because that was only 30 calls.

timewizard

6 months ago

[-]

The number of times I've caught developers wholesale copying stack overflow posts, errors and all, is far too high.

guerrilla

6 months ago

[-]

Indeed, the problems of LLMs are not new. We just automated what people who have no idea what they are doing were doing anyway. We... optimized incompetence.

SketchySeaBeast

6 months ago

[-]

The problem with the LLM equivalent is that you can't see the timestamp of the knowledge it's drawing from. With stack overflow I can see a post is from 2010 and look for something more modern, that due diligence is no longer available with an LLM, which has little reason to choose the newest solution.

6 months ago

[-]

This is a bit elitist isn’t it. It highly depends on the type of code copied and it’s huge part of software engineer bullishness approach to LLMs compared to most other professions.

Regardless of how competent as a programmer you are, you don’t necessarily possess the knowledge/answer to “How to find open ports on Linux” or “How to enumerate child pids of a parent pid” or “what is most efficient way to compare 2 byte arrays in {insert language}” etc. A search engine or an LLM is a fine solution for those problems.

You know that the answer to that question if what you’re after. I’d generally consider you knowing the right question to ask is all that matters. The answer is not interesting. It’s most likely a deeply nested knowledge about how Linux networking stack works, or how process management works on a particular OS. If that was the central point of the software we’re build (like for example we’re a Linux Networking Stack company) then by all means. It’s silly to find a lead engineer in our company who is confused about how open ports work in Linux.

sroussey

6 months ago

[-]

Read the license. CC BY-SA.

Copying code and breaking the license is a liability many companies don’t want and therefore block SO when in the office.

I’ve seen upvoted answers to questions around with stuff that purposefully has a backdoor in it (one character away from being a correct answer, so you are vulnerable only if you actually copied and pasted).

I think S.O. Is great, and LLMs too, but any “lead” engineer would try to learn and refute the content.

BTW: my favorite thing to do after an LLM gives a coding answer: now fix the bug.

The answers are hilarious. Oh, I see the security vulnerabilities. Or oh, this won’t work in an asynchronous environment. Etc, etc. Sometimes you have to be specific with the type of bug you spot (looking at you, sonnet 3.7). It’s worth adding to your cursor rules or similar.

6 months ago

[-]

All my 24-year career is among 4 “very large” software companies and 1 startup. 3 out of the 4 had a culture of “// https://stackoverflow.com/xxxxx” type comments on top of any piece of code that someone learned about from stackoverflow. There was one where everyone made a big fuss about such things in code reviews. They’ll ask “we don’t have any functions in this project that use this Linux syscall. How do you know this is what needs to be called???” And you had 2 ways of answering. You could link a kernel.org url saying “I looked through Linux sources and learned that to do X you need to call Y api” and everyone would reply “cool”, “great find”, etc. You could also say “I searched for X and found this stackoverflow response” which everyone will reply to as “stackoverflow is often wrong”, “do we have the right license to use that code”, “don’t use stackoverflow”, “please reconsider this code”

mschuster91

6 months ago

[-]

> There was one where everyone made a big fuss about such things in code reviews.

There's always dumb morons... sigh.

Even if you don't copy code from SO, it still makes sense to link to it if there is a decent explanation on whatever problem you were facing. When I write code and I hit some issue - particularly if it's some sort of weird ass edge case - I always leave a link to SO, and if it's something that's also known upstream but not fixed yet (common if you use niche stuff), I'll also leave a TODO comment linking to the upstream issue.

Code should not just be code, it should also be a document of knowledge and learning to your next fellow coder who touches the code.

(This also means: FFS do not just link stackoverflow in the git commit history. No one is looking there years later)

sroussey

6 months ago

[-]

Or just put the link in the code as the license requires.

Then… you could have a bot that watches for updates to the post in case it was wrong and someone points it out.

guerrilla

6 months ago

[-]

> This is a bit elitist isn’t it.

Damn straight. Understand what you're doing or don't do it. Software is bad enough as it is. There's absolutely no room for the incompetent in this game. That science experiment has been done to death and we're certain of the results.

knome

6 months ago

[-]

It's hardly unreasonable to expect your peers to at least _try_ to understand what they are doing. Copypaste coding is never conducive to a good codebase.

6 months ago

[-]

I do expect them to understand the code they are copying/pasting. Though to an extent. I understand they would test the code. They would try different inputs to the code and its result. I’d also understand they would test that code across all the different “Linux distros” we use, for example. After all, that code basically calls a Linux syscall, so I understand that’s very stable.

Then I learn that this particular syscall depends on this kernel build flag that Debian passes, but not alpine. You can get it in alpine if you set that other flag. What are you a “caveman not knowing that `pctxl: true` is the build flag to enable this feature?”

timewizard

6 months ago

[-]

In this case it was code to generate an "oauth2 code_challenge" and the correctly URLEncode it. Instead of using replaceAll the example used replace. So only the first character in the string was getting converted.

When pressed the developer said they thought their code was "too fast for the oauth server" and that's why it failed about 25% of the time.

The level of disappointment I had when I found the problem was enough to be memorable, but to find the post he flat out copied on stack overflow, along with a comment below it highlighting the bug AND the fix, nearly brought be to apoplexy.

6 months ago

[-]

To me “.replace()” vs “.replaceAll()” (in JS at least) is a perfect example to evaluate a developer on. Any JS developer would know that replace()’s main gotcha is that it’s not replaceAll(). I used C# professionally for years before using JS. And “.Replace()” in C# works the same way “.replaceAll()” does in JS. It was one of the first things I learned about JS and how I needed to reevaluate all my code in JS.

In interviews, I’d often ask the interviewee “what is your background” and “do you know that in JS .replace() is unlike .replace() in Java or .Replace() in .NET”. That statement should make perfect sense to any developer who realizes the word “replace” is somewhat ambiguous. I would always argue that the behavior of Java and .NET is the right behavior, but it’s an ambiguous word nonetheless.

jayd16

6 months ago

[-]

What's even worse is when you catch someone copying from the questions instead of the answers!

Dwedit

6 months ago

[-]

The call to "memcmp" has overhead. It's an imported function which cannot be inlined, and the marshaller will automatically create pinned GC handles to the memory inside of the arrays as they are passed to the native code.

I wonder how it would compare if you passed actual pointers to "memcmp" instead of marshalled arrays. You'd use "fixed (byte *p = bytes) {" on each array first so that the pinning happens outside of the function call.

https://gist.github.com/AustinWise/21d518fee314ad484eeec981a...

MarkSweep

6 months ago

[-]

I'm pretty sure the marshaling code for the pinvoke is not creating GC handles. It is just using a pinned local, like a fixed statement in csharp does. This is what the LibraryImport at least and I don't see why the built in marshaller would be different. The author says in the peer comment that they confirmed the performance is the same.

I think the blog post is quite good at showing that seemingly similar things can have different performance tradeoffs. A follow up topic might digging deeper into the why. For example, if you look at the disassembly of the p/invoke method, you can see the source of the overhead: setting up a p/invoke frame so the stack is walkable while in native code, doing a GC poll after returning from the native function, and removing the frame.

6 months ago

[-]

I tried that but cut it from the code because it had the same performance.

Rohansi

6 months ago

[-]

Have you tried the newer [LibraryImport] attribute?

6 months ago

[-]

I haven't, I wasn't aware of that attribute. I would gratefully accept a PR with such a benchmark case.

Edit: I've now tried it, and it reduced overhead a small amount. (e.g. Average 7.5 ns vs 8 ns for the 10-byte array )

6 months ago

[-]

memcmp and friends can be a funny one when looking at disasm

Depending on context and optimization settings we might see:

  - Gone entirely
  - A memcmp call has been inlined and turned into a single instruction
  - It's turned into a short loop
  - A loop has been turned into a memcmp call.

FWIW This is also one of the reasons why I think the VM-by-default / JIT way holds dotnet back. I find it very hard to be confident about what the assembly actually looks like, and after that.

Subtly I think it also encourages a "that'll do" mindset up the stack. You're working in an environment where you're not really incentivised to care so some patterns just don't feel like they'd have happened in a more native language.

int_19h

6 months ago

[-]

For what it's worth, I have read .NET JIT disassembly as part of perf work on a couple of occasions. On Windows, at least, Visual Studio enables this seamlessly - if you break inside managed code, you can switch to Disassembly view and see the actual native code corresponding to each line, step through it etc.

6 months ago

[-]

> I find it very hard to be confident about what the assembly actually looks like, and after that.

Godbolt is your friend as a DPGO-less baseline. Having JIT is an advantage w.r.t. selecting the best SIMD instruction set.

> Subtly I think it also encourages a "that'll do" mindset up the stack.

What is the basis for this assumption?

6 months ago

[-]

> Having JIT is an advantage w.r.t. selecting the best SIMD instruction set.

On paper yes but does anyone really rely on it? multiversioning is easy to do in a aot model too and even then most people don't bother. obviously sometimes its critical.

The more magic you put into the jit also makes it slower, so even though there are _loads_ of things you can do with a good JIT a lot them don't actually happen in practice.

PGO is one of those things. I've never really encountered it in dotnet but it is basically magic in frontend-bound programs like compilers.

> What is the basis for this assumption?

It's not an assumption, it's my impression of the dotnet ecosystem.

I do think also some patterns somewhat related to JITed-ness has led to some patterns (particularly around generics) that mean that common patterns in the language can't actually be expressed statically so one ends up with all kinds of quasi-dynamically typed runtime patterns e.g. dependency injection. But this is more of a design decision that comes from the same place.

6 months ago

[-]

> On paper yes but does anyone really rely on it?

Zeroing and copying, all string operations, comparisons like here in the article or inlined, selecting atomics on ARM64, fusing FP conversions and narrow SSE/AVX operations into masked or vpternlog when AVX512VL is available, selecting text search algorithms inside SearchValues<T>, which is what .NET's Regex engine (which is faster than PCRE2-JIT) builds upon, and quite a few places in CoreLib which use the primitive directly. Base64 encoding/decoding, UTF-8 transcoding. The list goes on.

The criticism here is unsubstantiated.

> that mean that common patterns in the language can't actually be expressed statically so one ends up with all kinds of quasi-dynamically typed runtime patterns

This has zero relationship with the underlying compilation model. NativeAOT works just fine and is subject to limitations you've come to expect when using a language based on LLVM (although .NET, save for NativeAOT-LLVM WASM target, does not use it because LLVM is not as good for a language which takes advantage of top-to-bottom GC and type system integration).

I think it is worth understanding what limitations .NET is subject to and what limitations it is not. This sounds a lot like very standard misconceptions you hear from C++ crowd.

6 months ago

[-]

> Zeroing and copying, all string operations,

Some of this is isel, some of this is fairly heavy autovec - does it actually do the latter on the fly? I would've thought that for memcpy and so on you'd drag around a hand tuned implementation like everyone else (or do what chrome does and jit a hand-written IR implementation) since its going to be so hot.

Does dotnet have loop autovec now? I can get it to unroll a loop but it seems to fall back to a loop past where N is in simd-heaven territory.

https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...

6 months ago

[-]

> Some of this is isel

Yes, and it's not constrained by whatever lowest common denominator was chosen at the moment of publishing the application or library.

Anyway

https://github.com/dotnet/runtime/blob/main/src/libraries/Co...

https://godbolt.org/z/MfnWd19n8 (sometimes you get AVX2 cores on Godbolt, sometimes AVX512 so I'm forcing it via NativeAOT for a better example)

Having the runtime pick the optimal instruction set for all the paths above requires exactly zero steps from the user, much like with using DynamicPGO (which is why forms of static PGO are not comparable for the common case).

> autovec

Most performance critical paths which are not explicitly vectorized in C++ or Rust are either very fragile and not autovectorized at all. If you care about performance, it is way better to have good SIMD abstractions. Which is what .NET heavily invests into over (very expensive) loop autovectorization phase. Although at this point it does almost everything else, but there are way more impactful areas of investment. If you care about SIMD - use Vector128/256/512 and/or platform intrinsics instead for much better results.

Although I can't shake off the impression that you are looking for gotchas here and details aren't of particular interest.

6 months ago

[-]

And I'm inclined to agree re autovec (fragile or not the code usually isn't that good) but that's to me at least why the JIT aspect isn't particularly attractive i.e. you'd have to do the work anyway, no?

6 months ago

[-]

With those flags I still can't seem to get it to do anything particularly interesting to a fixed length memset (e.g. at small N I would expect to see SSE instructions at least)

6 months ago

[-]

It looks like you haven't read my messages nor looked at the links.

merb

6 months ago

[-]

> That's not a super helpful description, but the summary is that it's stack-allocated rather than heap allocated.

I’m pretty sure that this is not 100% correct, since one can also use other allocation methods and use a span to represent it. Only with stackalloc will the memory it points to be stackallocated. What it basically means is that the type is stack allocated, always, but not the memory it points to.

https://learn.microsoft.com/en-us/dotnet/fundamentals/runtim...

MarkSweep

6 months ago

[-]

Yeah, as written this is quite confusing and does not describe why a Span is useful. It seems to be a garbled quoting of the first sentence of the supplement documentation about this API:

I think a better description of what a Span does is later in the article:

> A Span<T> represents a contiguous region of arbitrary memory. A Span<T> instance is often used to hold the elements of an array or a portion of an array. Unlike an array, however, a Span<T> instance can point to managed memory, native memory, or memory managed on the stack.

The fact that you have to put the Span<T> on the stack only is a limitation worth knowing (and enforced by the compiler). But it is not the most interesting thing about them.

6 months ago

[-]

Thank you, it was indeed a "garbled quoting" of that article. I am generally terrible at explaining things.

Trying to improve my ability to explain things was part of my motivation for taking up blogging.

int_19h

6 months ago

[-]

IIRC it is enforced not only by the compiler, but the runtime as well (for verifiable code).

john-h-k

6 months ago

[-]

Yes, this is correct. The span itself - the (ptr, len) pair - is on stack (by default) but the data is almost always on the heap, with stackalloc being the most notable exception

6 months ago

[-]

The design of spans does not make assumptions about this however. `ref T` pointer inside the span can point to any memory location.

It is not uncommon to wrap unmanaged memory in spans. Another popular case, even if it's something most developers not realize, is readonly spans wrapping constant data embedded in the application binary. For example, if you pass '[1, 2, 3, 4]' to an argument accepting 'ReadOnlySpan<int>' - this will just pass a reference to constant data. It also works for new T[] { } as long as T is a primitive and the target of the expression is a read-only span. It's quite prevalent nowadays but the language tries to get out of your way when doing so.

6 months ago

[-]

FWIW LINQ's SequenceEqual and many other CoreLib methods performing sequence comparison forward to the same underlying comparison routine used here whenever possible.

All of this builds on top of very powerful portable SIMD primitives and platform intrinsics that ship with the standard library.

runevault

6 months ago

[-]

The amount of optimizations, specifically around using stack allocated objects, .net has seen in recent years is amazing.

Another one beyond all the span stuff (though related) that got added in dotnet 9 was AlternateLookup for stuff like dictionary and HashSet where you create a stack allocated object that lets you use stack related objects to compare.

Simple example, if you have a dictionary you are building and you're parsing a json file, you can use spans and compare those directly into the dictionary without having to allocate new strings until you know it is a distinct value. (Yes I know you can just use the inbuilt json library, this was just he simplest example of the idea I could think of to get the point across).

junto

6 months ago

[-]

It’s astounding just how fast modern .NET has become. I’d be curious as to how the .NET (Framework excluded) benchmarks run in a Linux container.

6 months ago

[-]

I just did some benchmarks of this!

Linux in general provides the same speed for pure CPU workloads like generating JSON or HTML responses.

Some I/O operations run about 20% better, especially for small files.

One killer for us was that the Microsoft.Data.SqlClient is 7x slower on Linux and 10x slower on Linux with Docker compared to a plain Windows VM!

That has a net 2x slowdown effect for our applications which completely wipes out the licensing cost benefit when hosted in Azure.

Other database clients have different performance characteristics. Many users have reported that PostgreSQL is consistent across Windows and Linux.

6 months ago

[-]

> Microsoft.Data.SqlClient is 7x slower on Linux

It is probably worth reporting your findings and environment here: https://github.com/dotnet/SqlClient

Although I'm not sure how well-maintained SqlClient w.r.t. such regressions as I don't use it.

Also make sure to use the latest version of .NET and note that if you give a container anemic 256MB and 1C - under high throughput it won't be able to perform as fast as the application that has an entire host to itself.

6 months ago

[-]

I’m using the latest everything and it’s still slow as molasses.

This issue has been reported years ago by multiple people and Microsoft has failed to fix it, despite at least two attempts at it.

Basically, only the original C++ clients work with decent efficiency, and the Windows client is just a wrapper around this. The portable “managed”, MARS, and async clients are all buggy (including data corruption) and slow as molasses. This isn’t because of the .NET CLR but because of O(n^2) algorithms in basic packet reassembly steps!

I’ve researched this quite a bit, and a fundamental issue I noticed was that the SQL Client dev team doesn’t test their code for performance with realistic network captures. They replay traces from disk, which is “cheating” because they never see a partial buffer like you would see on an Ethernet network where you get ~1500 bytes per packet instead of 64KB aligned(!) reads from a file.

6 months ago

[-]

This is unfortunate. I've been mainly using Postgres so luckily avoided the issues you speak of. I guess yet another reason towards the bucket of "why use Postgres/MariaDB instead".

6 months ago

[-]

> luckily avoided the issues you speak of

That may be a bit of an assumption. I've been perpetually surprised by expectation-versus-reality, especially in the database world where very few people publish comparative benchmarks because of the "DeWitt clause": https://en.wikipedia.org/wiki/David_DeWitt

Additionally, a lot of modern DevOps abstractions are most decidedly not zero cost! Containers, Envoys, Ingress, API Management, etc... all add up rapidly, to the point where most applications can't utilise even 1/10th of one CPU core for a single user. The other 90% of the time is lost to networking overheads.

Similarly, the typical developers' concept of "fast" doesn't align with mine. My notion of "fast" is being able to pump nine billion bits per second through a 10 Gbps Ethernet link. I've had people argue until they're blue in the face that that is unrealistic.

Analemma_

6 months ago

[-]

I agree, .NET Core has improved by gigantic leaps and bounds. Which makes it all the more frustrating to me that .NET and Java both had "lost decades" of little to no improvement. Java mostly only on the language side, where 3rd-party JVMs still saw decent changes, but .NET both on the language and runtime side. I think this freeze made (and continues to make) people think the ceiling of both performance and developer ergonomics of these languages is much lower than it actually is.

paavohtl

6 months ago

[-]

I certainly agree that Java / JVM had a lost decade (or even more), but not really with C# / .NET. When do you consider that lost decade to have been? C# has had a major release with new language features every 1-3 years, consistently for the past 20+ years.

CharlieDigital

6 months ago

[-]

Lost decade in another sense in the case of C#.

It's sooooo good now. Fast, great DX, LINQ, Entity Framework, and more!

But I still come across a lot of folks that think it's still in the .NET Framework days and bound to Windows or requires paid tooling like Visual Studio.

MortyWaves

6 months ago

[-]

Those people are all wilfully ignorant at this point.

CharlieDigital

6 months ago

[-]

I know!

I'm working on a large TypeScript codebase right now (Nest.js + Prisma) and it's actually really, really bad.

Primarily because Prisma generates a ton of intermediate models as output from the schema.

On the other hand, in EF you simply work with the domain model and anonymous types that you transform at the boundary.

Nest.js + Prisma ends up being far more complex than .NET web APIs + EF because of this lack or runtime types. Everything feels like a slog.

torginus

6 months ago

[-]

.NET was always fast. I remember in the .NET framework 2.0 days, .NET's JIT for derived from the Microsoft C++ compiler, with some of the more expensive optimizations (like loop hoisting) removed and general optimization effort pared back.

But If you knew what you were doing, for certain kinds of math heavy code, and aggressive use of low level features (like raw pointers) you could get within 10% of C++ code, with the general case being that garden variety non super optimized code being half as fast as equivalent C++ code.

I think this ratio has remained pretty consistent over the years.

api

6 months ago

[-]

I wonder how it compares to (1) Go, (2) the JVM, and (3) native stuff like Rust and C++.

Obviously as with all such benchmarks the skill of the programmer doing the implementing matters a lot. You can write inefficient clunky code in any language.

kfuse

6 months ago

[-]

All modern popular languages are fast, except the most popular one.

api

6 months ago

[-]

JavaScript is hella fast for a dynamically typed language, but that's because we've put insane amounts of effort into making fast JITing VMs for it.

zamalek

6 months ago

[-]

Sure, but "for a dynamically typed language" still means that it's slow amongst all languages.

paulddraper

6 months ago

[-]

And Python+Ruby

kristianp

6 months ago

[-]

I would say go is not in the same category of speed as rust anf c/c++. The level of optimisation done by them is next level. Go also doesn't inline your assembly functions, has less vectorisation in the standard libraries, and doesn't allow you to easily add vectorisation with intrinsics.

https://medium.com/deno-the-complete-reference/net-vs-java-v...

junto

6 months ago

[-]

jeffbee

6 months ago

[-]

Java and .NET (and JS or anything that runs under v8 or HotSpot) usually compare favorably to others because they come out of the box with PGO. The outcomes for peak-optimized C++ are very good, but few organizations are capable of actually getting from their C++ build what every .NET user gets for free.

metaltyphoon

6 months ago

[-]

.NET go as far as having D(ynamic)PGO, which is enabled by default.

https://stackoverflow.com/questions/75309389/which-processor...

userbinator

6 months ago

[-]

Interestingly, Intel made REP CMPS much faster in the latest CPUs:

bob1029

6 months ago

[-]

Span<T> is easily my favorite new abstraction. I've been using the hell out of it for building universal Turing machine interpreters. It's really great at passing arbitrary views of physical data around. I default to using it over arrays in most places now.

CyanLite2

6 months ago

[-]

The article missed the biggest thing:

SequenceEquals is SIMD accelerated. memcmp is not.

qingcharles

6 months ago

[-]

This is almost certainly the answer.

There are a bunch of Intel folks on the dotnet core github regularly pushing new SIMD updates for CPUs that aren't even released yet. They are trying to make sure your .NET code runs nice on your new datacenter servers.

OptionOfT

6 months ago

[-]

I did some digging, and found that SequenceEquals is heavily optimized for when T = Byte: https://github.com/dotnet/runtime/blob/454673e1d6da406775064...

Does memcmp do all of these things? Is msvcrt.dll checking at runtime which extensions the CPU support?

Because I don't think msvcrt.dll is recompiled per machine.

I think a better test would be to create a DLL in C, expose a custom version of memcmp, and compile that with all the vectorization enabled.

6 months ago

[-]

The comparison isn't to prove that .NET is always faster than C in all circumstances, it was to demonstrate that the advice to call out to C from .NET is outdated and now worse than the naive approach.

Can C wizards write faster code? I'm sure they can, but I bet it takes longer than writing a.SequenceEquals(b) and moving on to the next feature, safe in the knowledge that the standard library is taking care of business.

"Your standard library is more heavily optimised" isn't exactly a gotcha. Yes, the JIT nature of .NET means that it can leverage processor features at runtime, but that is a benefit to being compiled JIT.

asveikau

6 months ago

[-]

> Does memcmp do all of these things? Is msvcrt.dll checking at runtime which extensions the CPU support

It's possible for a C implemention to check the CPU at dynamic link time (when the DLL is loaded) and select which memcmp gets linked.

The most heavily used libc string functions also have a tendency to use SIMD when the data sizes and offsets align, and fall back to the slow path for any odd/unaligned bytes.

I don't know to what extent MSVCRT is using these techniques. Probably some.

Also, it's common for a compiler to recognize references to common string functions and not even emit a call to a shared library, but provide an inline implementation.

6 months ago

[-]

It's not limited to bytes. It works with any bitwise comparable primitive i.e. int, long, char, etc.

The logic which decides which path to use is here https://github.com/dotnet/runtime/blob/main/src/libraries/Sy... and here https://github.com/dotnet/runtime/blob/main/src/coreclr/tool... (this one is used by ILC for NativeAOT but the C++ impl. for the JIT is going to be similar)

The [Intrinsic] annotation is present because such comparisons on strings/arrays/spans are specially recognized in the compiler to be unrolled and inlined whenever one of the arguments has constant length or is a constant string or a span which points to constant data.

int_19h

6 months ago

[-]

memcmp is also supposed to be heavily optimized for comparing arrays of bytes since, well, that is literally all that it does.

msvcrt.dll is the C runtime from VC++6 days; a modern (as in, compiled against VC++ released in the last 10 years) C app would use the universal runtime, ucrt.dll. That said, stuff like memcpy or memcmp is normally a compiler intrinsic, and the library version is there only so that you can take an pointer to it and do other such things that require an actual function.

loeg

6 months ago

[-]

This has gotta be some sort of modest overhead from calling into C memcmp that is avoided by using the native C# construct, right? There's no reason the two implementations shouldn't be doing essentially the same thing internally.

6 months ago

[-]

Outside of the 10 elements case, I don't think it's an overhead issue, the overhead is surely minuscule compared to the 1GB of data in the final tests, which also show a large difference in performance.

I suspect it's that the memcmp in the Visual C++ redistributable isn't as optimised for modern processor instructions as the .NET runtime is.

I'd be interested to see a comparison against a better more optimised runtime library.

Ultimately you're right that neither .NET nor C can magic out performance from a processor that isn't fundamentally there, but it's nice that doing the out-of-the-box approach performs well and doesn't require tricks.

lstodd

6 months ago

[-]

might as well mean that msvcrt's memcmp is terrible

airybreath

6 months ago

[-]

The post links to one answer to a StackOverflow question, but the top answer to that same question when sorting by "Trending (recent votes count more)", << https://stackoverflow.com/a/48599119/1083771 >>, suggests exactly* the same thing: use ReadOnlySpan<T>.SequenceEqual

*the post suggests that IEnumerable<T>.SequenceEqual is more-or-less the same, but the underlying reason is because ReadOnlySpan<T>.SequenceEqual is so fast that the implementation of IEnumerable<T>.SequenceEqual spends a bit of overhead in order to let it use that exact same call when feasible: https://github.com/dotnet/runtime/blob/v9.0.3/src/libraries/...

6 months ago

[-]

Is anyone else annoyed with the terrible data visualisations?

Too many data points for a bar chart, the colours are far too close together, the colours are easily confused by red-green colourblind users, the colours rotate all the way back to the same yellow/orange/red causing duplicates, and neither the bars nor the colours are in any meaningful kind of order!

Then the table shows nanoseconds to 3-digits of fractional precision, which is insane because no modern CPU has clock speeds above 6 Ghz, which is 1/6th of a nanosecond. There is no point showing 1/1000th of a nanosecond!

This is just begging to be a pivot-table, but that's a rare sight outside of the finance department.

Better yet, show clocks-per-byte at different sizes, which is the meaningful number developers are interested in.

Even better yet, take measurements at many more sizes and compute a fit to estimate the fixed overhead (y-intercept) and the clocks-per-byte (slope) and show only those.

groos

6 months ago

[-]

This is a little bit of bait and click. Of course, SequenceEquals is not as fast as memcmp in absolute terms. In a C or C++ program memcmp usually translates into a compiler intrinsic under optimization. It's only slower than SequenceEquals because of P/Invoke and function call overhead while SequenceEquals is probably JIT-compiled into efficient instructions.

iforgotpassword

6 months ago

[-]

I don't think it's clickbait. Even though the title doesn't mention C# or .net explicitly it seems clear from that Span<> stuff that this is talking about some higher level language...

6 months ago

[-]

You can look at SquenceEqual implementation and see for yourself. It is as fast in absolute terms and likely faster because it can pick the widest supported vectors. Maybe not as unrolled bot mostly because it’s already fast enough.

DeathArrow

6 months ago

[-]

.NET got pretty fast.

theshrike79

6 months ago

[-]

That domain is kinda unfortunate, it has expert-sexchange vibes with Rich Hard ...

Wonder how many over-eager corporate filters block it outright?

npalli

6 months ago

[-]

Any idea on how it compares to std::span in C++. .NET is indeed quite fast nevertheless, (at least since 6).

jeffbee

6 months ago

[-]

C++ std::span doesn't have comparison operators at all. If you were to write an equality operator you might use memcmp, in which case it will be exactly the same as memcmp, which LLVM will helpfully reinterpret as bcmp for best performance.

See for example the difference between std::string::operator== and just calling memcmp yourself: https://godbolt.org/z/qn1crox8c

f33d5173

6 months ago

[-]

  bcmp() is identical to memcmp(3); use it instead.

jeffbee

6 months ago

[-]

This is what we call "a lie". There's a reason the compiler generates bcmp instead.

loeg

6 months ago

[-]

I think it is substantially equivalent to C++ std::span (essentially a typed pointer + length pair).

Dwedit

6 months ago

[-]

Span<T> is a "ref struct" type, and thus has a lot of restrictions on its use. For instance, you can't have one as a field or property in a class, so you can't assign it to outside of the scope that it was declared in.

6 months ago

[-]

You can assign the span to an out of scope variable as long as it does not violate the scoping of the memory referenced by that said span. The closer primitive to C#'s Span<T> is Rust's &mut [T] since both are subject to lifetime analysis (of course the one in C#[0] is quite rudimentary in comparison to Rust).

[0]: https://em-tg.github.io/csborrow/

29athrowaway

6 months ago

[-]

It's not memcmp that's slow, it's the .NET marshalling that has an extra cost.