Old stackoverflow answers are a dangerous form of bit-rot. They get picked up by well-meaning developers and LLMs alike and recreated years after they are out of date.
https://github.com/dotnet/runtime/issues/114047#issuecomment...
Why do you have to call it more than 50 times before it gets fully optimized?? Is the decision-maker completely unaware of the execution time?
The OSR transition happens here, but between .net8 and .net9 some aspects of loop optimizations in OSR code regressed.
With BenchmarkDotNet it may not be obvious which scenario you intend to measure and which one you end up measuring. BDN runs the benchmark method enough times to exceed some overall "goal" time for measuring (250 ms I think). This may require many calls or may just require one.
There are also often multiple concrete types that can be passed in, optimising for one will not help if it is also getting called with other concrete types.
I don't buy that logic.
It can use the length of the function to estimate how long it will take.
It can estimate the time savings by the total amount of time the function uses. Time used is a far better metric than call count. And the math to track it is not significantly more complicated than a counter.
> It can use the length of the function to estimate how long it will take.
Ah, yes, because a function that defines and then prints a 10,000 line string will take x1,000 longer to run than a 10 line function which does matrix multiplication over several billion elements.It is naive eitherway
If you read the linked conversation, you'll notice that there are multiple factors at play.
Here's the document that roughly outlines the tiered compilation and DPGO flows: https://github.com/dotnet/runtime/blob/main/docs/design/feat... note that it may be slightly dated since the exact tuning is subject to change between releases
Regardless of how competent as a programmer you are, you don’t necessarily possess the knowledge/answer to “How to find open ports on Linux” or “How to enumerate child pids of a parent pid” or “what is most efficient way to compare 2 byte arrays in {insert language}” etc. A search engine or an LLM is a fine solution for those problems.
You know that the answer to that question if what you’re after. I’d generally consider you knowing the right question to ask is all that matters. The answer is not interesting. It’s most likely a deeply nested knowledge about how Linux networking stack works, or how process management works on a particular OS. If that was the central point of the software we’re build (like for example we’re a Linux Networking Stack company) then by all means. It’s silly to find a lead engineer in our company who is confused about how open ports work in Linux.
Copying code and breaking the license is a liability many companies don’t want and therefore block SO when in the office.
I’ve seen upvoted answers to questions around with stuff that purposefully has a backdoor in it (one character away from being a correct answer, so you are vulnerable only if you actually copied and pasted).
I think S.O. Is great, and LLMs too, but any “lead” engineer would try to learn and refute the content.
BTW: my favorite thing to do after an LLM gives a coding answer: now fix the bug.
The answers are hilarious. Oh, I see the security vulnerabilities. Or oh, this won’t work in an asynchronous environment. Etc, etc. Sometimes you have to be specific with the type of bug you spot (looking at you, sonnet 3.7). It’s worth adding to your cursor rules or similar.
Then… you could have a bot that watches for updates to the post in case it was wrong and someone points it out.
There's always dumb morons... sigh.
Even if you don't copy code from SO, it still makes sense to link to it if there is a decent explanation on whatever problem you were facing. When I write code and I hit some issue - particularly if it's some sort of weird ass edge case - I always leave a link to SO, and if it's something that's also known upstream but not fixed yet (common if you use niche stuff), I'll also leave a TODO comment linking to the upstream issue.
Code should not just be code, it should also be a document of knowledge and learning to your next fellow coder who touches the code.
(This also means: FFS do not just link stackoverflow in the git commit history. No one is looking there years later)
Then I learn that this particular syscall depends on this kernel build flag that Debian passes, but not alpine. You can get it in alpine if you set that other flag. What are you a “caveman not knowing that `pctxl: true` is the build flag to enable this feature?”
Damn straight. Understand what you're doing or don't do it. Software is bad enough as it is. There's absolutely no room for the incompetent in this game. That science experiment has been done to death and we're certain of the results.
When pressed the developer said they thought their code was "too fast for the oauth server" and that's why it failed about 25% of the time.
The level of disappointment I had when I found the problem was enough to be memorable, but to find the post he flat out copied on stack overflow, along with a comment below it highlighting the bug AND the fix, nearly brought be to apoplexy.
In interviews, I’d often ask the interviewee “what is your background” and “do you know that in JS .replace() is unlike .replace() in Java or .Replace() in .NET”. That statement should make perfect sense to any developer who realizes the word “replace” is somewhat ambiguous. I would always argue that the behavior of Java and .NET is the right behavior, but it’s an ambiguous word nonetheless.
I wonder how it would compare if you passed actual pointers to "memcmp" instead of marshalled arrays. You'd use "fixed (byte *p = bytes) {" on each array first so that the pinning happens outside of the function call.
I think the blog post is quite good at showing that seemingly similar things can have different performance tradeoffs. A follow up topic might digging deeper into the why. For example, if you look at the disassembly of the p/invoke method, you can see the source of the overhead: setting up a p/invoke frame so the stack is walkable while in native code, doing a GC poll after returning from the native function, and removing the frame.
https://gist.github.com/AustinWise/21d518fee314ad484eeec981a...
Edit: I've now tried it, and it reduced overhead a small amount. (e.g. Average 7.5 ns vs 8 ns for the 10-byte array )
Depending on context and optimization settings we might see:
- Gone entirely
- A memcmp call has been inlined and turned into a single instruction
- It's turned into a short loop
- A loop has been turned into a memcmp call.
FWIW This is also one of the reasons why I think the VM-by-default / JIT way holds dotnet back. I find it very hard to be confident about what the assembly actually looks like, and after that.Subtly I think it also encourages a "that'll do" mindset up the stack. You're working in an environment where you're not really incentivised to care so some patterns just don't feel like they'd have happened in a more native language.
Godbolt is your friend as a DPGO-less baseline. Having JIT is an advantage w.r.t. selecting the best SIMD instruction set.
> Subtly I think it also encourages a "that'll do" mindset up the stack.
What is the basis for this assumption?
On paper yes but does anyone really rely on it? multiversioning is easy to do in a aot model too and even then most people don't bother. obviously sometimes its critical.
The more magic you put into the jit also makes it slower, so even though there are _loads_ of things you can do with a good JIT a lot them don't actually happen in practice.
PGO is one of those things. I've never really encountered it in dotnet but it is basically magic in frontend-bound programs like compilers.
> What is the basis for this assumption?
It's not an assumption, it's my impression of the dotnet ecosystem.
I do think also some patterns somewhat related to JITed-ness has led to some patterns (particularly around generics) that mean that common patterns in the language can't actually be expressed statically so one ends up with all kinds of quasi-dynamically typed runtime patterns e.g. dependency injection. But this is more of a design decision that comes from the same place.
Zeroing and copying, all string operations, comparisons like here in the article or inlined, selecting atomics on ARM64, fusing FP conversions and narrow SSE/AVX operations into masked or vpternlog when AVX512VL is available, selecting text search algorithms inside SearchValues<T>, which is what .NET's Regex engine (which is faster than PCRE2-JIT) builds upon, and quite a few places in CoreLib which use the primitive directly. Base64 encoding/decoding, UTF-8 transcoding. The list goes on.
The criticism here is unsubstantiated.
> that mean that common patterns in the language can't actually be expressed statically so one ends up with all kinds of quasi-dynamically typed runtime patterns
This has zero relationship with the underlying compilation model. NativeAOT works just fine and is subject to limitations you've come to expect when using a language based on LLVM (although .NET, save for NativeAOT-LLVM WASM target, does not use it because LLVM is not as good for a language which takes advantage of top-to-bottom GC and type system integration).
I think it is worth understanding what limitations .NET is subject to and what limitations it is not. This sounds a lot like very standard misconceptions you hear from C++ crowd.
Some of this is isel, some of this is fairly heavy autovec - does it actually do the latter on the fly? I would've thought that for memcpy and so on you'd drag around a hand tuned implementation like everyone else (or do what chrome does and jit a hand-written IR implementation) since its going to be so hot.
Does dotnet have loop autovec now? I can get it to unroll a loop but it seems to fall back to a loop past where N is in simd-heaven territory.
Yes, and it's not constrained by whatever lowest common denominator was chosen at the moment of publishing the application or library.
Anyway
https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
https://github.com/dotnet/runtime/blob/main/src/libraries/Sy...
https://github.com/dotnet/runtime/blob/main/src/libraries/Co...
https://godbolt.org/z/MfnWd19n8 (sometimes you get AVX2 cores on Godbolt, sometimes AVX512 so I'm forcing it via NativeAOT for a better example)
Having the runtime pick the optimal instruction set for all the paths above requires exactly zero steps from the user, much like with using DynamicPGO (which is why forms of static PGO are not comparable for the common case).
> autovec
Most performance critical paths which are not explicitly vectorized in C++ or Rust are either very fragile and not autovectorized at all. If you care about performance, it is way better to have good SIMD abstractions. Which is what .NET heavily invests into over (very expensive) loop autovectorization phase. Although at this point it does almost everything else, but there are way more impactful areas of investment. If you care about SIMD - use Vector128/256/512 and/or platform intrinsics instead for much better results.
Although I can't shake off the impression that you are looking for gotchas here and details aren't of particular interest.
I’m pretty sure that this is not 100% correct, since one can also use other allocation methods and use a span to represent it. Only with stackalloc will the memory it points to be stackallocated. What it basically means is that the type is stack allocated, always, but not the memory it points to.
https://learn.microsoft.com/en-us/dotnet/fundamentals/runtim...
I think a better description of what a Span does is later in the article:
> A Span<T> represents a contiguous region of arbitrary memory. A Span<T> instance is often used to hold the elements of an array or a portion of an array. Unlike an array, however, a Span<T> instance can point to managed memory, native memory, or memory managed on the stack.
The fact that you have to put the Span<T> on the stack only is a limitation worth knowing (and enforced by the compiler). But it is not the most interesting thing about them.
Trying to improve my ability to explain things was part of my motivation for taking up blogging.
It is not uncommon to wrap unmanaged memory in spans. Another popular case, even if it's something most developers not realize, is readonly spans wrapping constant data embedded in the application binary. For example, if you pass '[1, 2, 3, 4]' to an argument accepting 'ReadOnlySpan<int>' - this will just pass a reference to constant data. It also works for new T[] { } as long as T is a primitive and the target of the expression is a read-only span. It's quite prevalent nowadays but the language tries to get out of your way when doing so.
Another one beyond all the span stuff (though related) that got added in dotnet 9 was AlternateLookup for stuff like dictionary and HashSet where you create a stack allocated object that lets you use stack related objects to compare.
Simple example, if you have a dictionary you are building and you're parsing a json file, you can use spans and compare those directly into the dictionary without having to allocate new strings until you know it is a distinct value. (Yes I know you can just use the inbuilt json library, this was just he simplest example of the idea I could think of to get the point across).
All of this builds on top of very powerful portable SIMD primitives and platform intrinsics that ship with the standard library.
Linux in general provides the same speed for pure CPU workloads like generating JSON or HTML responses.
Some I/O operations run about 20% better, especially for small files.
One killer for us was that the Microsoft.Data.SqlClient is 7x slower on Linux and 10x slower on Linux with Docker compared to a plain Windows VM!
That has a net 2x slowdown effect for our applications which completely wipes out the licensing cost benefit when hosted in Azure.
Other database clients have different performance characteristics. Many users have reported that PostgreSQL is consistent across Windows and Linux.
It is probably worth reporting your findings and environment here: https://github.com/dotnet/SqlClient
Although I'm not sure how well-maintained SqlClient w.r.t. such regressions as I don't use it.
Also make sure to use the latest version of .NET and note that if you give a container anemic 256MB and 1C - under high throughput it won't be able to perform as fast as the application that has an entire host to itself.
This issue has been reported years ago by multiple people and Microsoft has failed to fix it, despite at least two attempts at it.
Basically, only the original C++ clients work with decent efficiency, and the Windows client is just a wrapper around this. The portable “managed”, MARS, and async clients are all buggy (including data corruption) and slow as molasses. This isn’t because of the .NET CLR but because of O(n^2) algorithms in basic packet reassembly steps!
I’ve researched this quite a bit, and a fundamental issue I noticed was that the SQL Client dev team doesn’t test their code for performance with realistic network captures. They replay traces from disk, which is “cheating” because they never see a partial buffer like you would see on an Ethernet network where you get ~1500 bytes per packet instead of 64KB aligned(!) reads from a file.
That may be a bit of an assumption. I've been perpetually surprised by expectation-versus-reality, especially in the database world where very few people publish comparative benchmarks because of the "DeWitt clause": https://en.wikipedia.org/wiki/David_DeWitt
Additionally, a lot of modern DevOps abstractions are most decidedly not zero cost! Containers, Envoys, Ingress, API Management, etc... all add up rapidly, to the point where most applications can't utilise even 1/10th of one CPU core for a single user. The other 90% of the time is lost to networking overheads.
Similarly, the typical developers' concept of "fast" doesn't align with mine. My notion of "fast" is being able to pump nine billion bits per second through a 10 Gbps Ethernet link. I've had people argue until they're blue in the face that that is unrealistic.
It's sooooo good now. Fast, great DX, LINQ, Entity Framework, and more!
But I still come across a lot of folks that think it's still in the .NET Framework days and bound to Windows or requires paid tooling like Visual Studio.
I'm working on a large TypeScript codebase right now (Nest.js + Prisma) and it's actually really, really bad.
Primarily because Prisma generates a ton of intermediate models as output from the schema.
On the other hand, in EF you simply work with the domain model and anonymous types that you transform at the boundary.
Nest.js + Prisma ends up being far more complex than .NET web APIs + EF because of this lack or runtime types. Everything feels like a slog.
But If you knew what you were doing, for certain kinds of math heavy code, and aggressive use of low level features (like raw pointers) you could get within 10% of C++ code, with the general case being that garden variety non super optimized code being half as fast as equivalent C++ code.
I think this ratio has remained pretty consistent over the years.
Obviously as with all such benchmarks the skill of the programmer doing the implementing matters a lot. You can write inefficient clunky code in any language.
https://stackoverflow.com/questions/75309389/which-processor...
SequenceEquals is SIMD accelerated. memcmp is not.
Does memcmp do all of these things? Is msvcrt.dll checking at runtime which extensions the CPU support?
Because I don't think msvcrt.dll is recompiled per machine.
I think a better test would be to create a DLL in C, expose a custom version of memcmp, and compile that with all the vectorization enabled.
Can C wizards write faster code? I'm sure they can, but I bet it takes longer than writing a.SequenceEquals(b) and moving on to the next feature, safe in the knowledge that the standard library is taking care of business.
"Your standard library is more heavily optimised" isn't exactly a gotcha. Yes, the JIT nature of .NET means that it can leverage processor features at runtime, but that is a benefit to being compiled JIT.
It's possible for a C implemention to check the CPU at dynamic link time (when the DLL is loaded) and select which memcmp gets linked.
The most heavily used libc string functions also have a tendency to use SIMD when the data sizes and offsets align, and fall back to the slow path for any odd/unaligned bytes.
I don't know to what extent MSVCRT is using these techniques. Probably some.
Also, it's common for a compiler to recognize references to common string functions and not even emit a call to a shared library, but provide an inline implementation.
msvcrt.dll is the C runtime from VC++6 days; a modern (as in, compiled against VC++ released in the last 10 years) C app would use the universal runtime, ucrt.dll. That said, stuff like memcpy or memcmp is normally a compiler intrinsic, and the library version is there only so that you can take an pointer to it and do other such things that require an actual function.
The logic which decides which path to use is here https://github.com/dotnet/runtime/blob/main/src/libraries/Sy... and here https://github.com/dotnet/runtime/blob/main/src/coreclr/tool... (this one is used by ILC for NativeAOT but the C++ impl. for the JIT is going to be similar)
The [Intrinsic] annotation is present because such comparisons on strings/arrays/spans are specially recognized in the compiler to be unrolled and inlined whenever one of the arguments has constant length or is a constant string or a span which points to constant data.
I suspect it's that the memcmp in the Visual C++ redistributable isn't as optimised for modern processor instructions as the .NET runtime is.
I'd be interested to see a comparison against a better more optimised runtime library.
Ultimately you're right that neither .NET nor C can magic out performance from a processor that isn't fundamentally there, but it's nice that doing the out-of-the-box approach performs well and doesn't require tricks.
Wonder how many over-eager corporate filters block it outright?
See for example the difference between std::string::operator== and just calling memcmp yourself: https://godbolt.org/z/qn1crox8c
bcmp() is identical to memcmp(3); use it instead.
Too many data points for a bar chart, the colours are far too close together, the colours are easily confused by red-green colourblind users, the colours rotate all the way back to the same yellow/orange/red causing duplicates, and neither the bars nor the colours are in any meaningful kind of order!
Then the table shows nanoseconds to 3-digits of fractional precision, which is insane because no modern CPU has clock speeds above 6 Ghz, which is 1/6th of a nanosecond. There is no point showing 1/1000th of a nanosecond!
This is just begging to be a pivot-table, but that's a rare sight outside of the finance department.
Better yet, show clocks-per-byte at different sizes, which is the meaningful number developers are interested in.
Even better yet, take measurements at many more sizes and compute a fit to estimate the fixed overhead (y-intercept) and the clocks-per-byte (slope) and show only those.