There was a particular game that was superslow when this tech was applied. Original game loading took around 15-20 seconds, whereas once the tech was applied it took easily 3-5 min, even with all data already downloaded.
When I started digging into it, I realized the reason was the game was using something like
fread(data, 1, 65536, fptr);
instead of fread(data, 65536, 1, fptr);
Which basically expanded back in the day to 65k reads of 1 byte for several MB file. Each fread translated to 65k reads of ReadFile Windows API. Since my code was hooking on ReadFile system call, and my call was heavier than ReadFile, the game loading felt really slow. Unusable. It would have not been fun for players.The easy fix was to swap arguments for certain calls. The long fix required to use an internal cache to account for these cases so that the hooked ReadFile was faster when data was already in disk.
Funny thing is that as we started rolling out the tech and applying it to more and more games we realized lots of games did this. We went for the cache fix and games ended up loading faster than before. Honestly, games could have load all the data in a couple of seconds by just swapping the args. I'm guessing developers did this on purpose so that games seemed like they were loading a lot of stuff, although you never know.
To be fair excel would erase places white that it wanted to write up to 9 times before it drew any black pixels, we made that very fast! we didn't tell them :-)
At the time 24-bit framebuffers were so slow that before we built graphics acceleration hardware people would switch back to 8-bit to get stuff done, making 24-bit/true colour your daily driver was a big step forward.
PS – I am looking through the NuBus cards that I have... did you work for SuperMac or RasterOps?
One of the other bugs (the Quark/ATM one) was also because of the programmers were worried about writing over stuff that hadn't been completely erased, the Quark guys wrote a string with 2 spaces at the end through a box that masked the end of the string, the ATM font renderer saw it couldn't fit the text so it split it in half and tried again so it drew N/2 N/4 N/8 ... strings. It spent all it's time in the 68k's multiply instructions figuring out how wide the strings (and substrings) were, our fancy 24-bit character rendering hardware was an afterthought
I feel like I'm having a stroke trying to read this, what does it mean??
What software did that that badly? If the code asks for (up to) 65,536 single byte items, why would you split that into 65,536 calls?
Also, that change changes behavior. The old call could read anything from zero to 65,536 bytes, the new one only can read zero or 65,536 bytes.
(Reading the source of a few implementations, I think most implementations will fill the output buffer with partial objects if the input doesn’t supply an integral number of them, but the return value of fread cannot signal that to the caller)
But I think the parent comment's point is that the issue is in the implementation of fread itself in the standard library. It's perfectly reasonable for an application to pass it 1, 65536 (i.e. one byte, up to 65536 times) and expect it not to issue 65536 separate OS calls.
No, I'm not saying that's why. I'm simply saying there is a difference between asking for 1 byte or 65k bytes of something. Even dd runs the same under Linux.
dd bs=10k count=1 is faster than bs=1 count=10k
I remember trying to recover some data from a spinning disk, and trying to slowly creep up on the data. So I wanted 1 byte per, I wanted it to nibble, until it hit whatever the errored part was. If I just grabbed the lot, it'd error out from the whole read.
The latter (as usual when comparing OpenBSD and Linux) is more complex, but both multiply count by size and then go their way.
Also, the API contract allows fread to read fewer bytes than requested. I would except any implementation to do that.
But maybe, somebody interpreted the contract differently than major OSes, in the sense that a call isn’t allowed to write partial size-sized chunks to user memory and/or advance the file position further than its return value advocates (that, I think, is something that the implementations above can do, and might be considered a bug)
I've not looked at the code (or even the man pages) and it is a long time since I touched anything that low level, so this might be completely wrong, but if there is an error before the next 64KiB (including just hitting EOF) then the semantics could be different. Asking for 1x64KiB I would expect to just error as there aren't the requested number of bytes. Asking for 64Ki lots of 1 byte might simple error just the same, or it might at least populate the buffer with what it can read, or if the meaning of 1,65536 is actually “up to 64Ki lots of 1B” then it would populate the buffer as far as possible and return the amount read rather than an error condition.
If the per-byte option is slow but still fast enough, and dealing with the semantics is less faf, then people will go for that because the tiny time loss is worth the larger effort reduction. Of course this assumes the underlying system doesn't change, as with the “making local code to run as on-demand networked code” example higher in the thread which changes the relative performance characteristics of the two calling methods significantly.
fread(data, 1, sizeof(buffer), f);
with the rationale that I'm interested in reading sizeof(buffer) individual bytes. The buffer size is incidental, not the size of the items I'm trying to read from the file; "read one item whose size is sizeof(buffer)" seems semantically wrong.Is this just the case of Windows having a bad stdlib fread implementation 15 years ago or is my thinking here actually wrong?
The C runtime authors did (presumably Microsoft, if it's MSVCRT).
He's hooking into ReadFile, a layer below the stdlib. By the time it reaches the hook, it's already split.
But in this case if the code was calling fread 65536 times in a loop and getting 64KiB each time it wouldn't be good either!
Sounds like the parent comment had to fix this with the internal cache thing to speed up the small freads. I think they meant the easy fix would have been swapping the args in the original / caller code.
Edit: mort96: So did you check the return value or not?
I really hope that was not the case and rather think incompetence or to deal with obscure legacy problems, but the gamer in me gets enraged at the thought someone would artificially increase loading times.
https://silentsblog.com/2025/04/23/gta-san-andreas-win11-24h...
The optimizer was allowed, but not obligated, to transform that into: x = a
However, in this case, b was sometimes 0. And if so, the unoptimized version computed: x = a / 0 * 0 = Inf * 0 = NaN
So badness ensued if the that particular path didn't get optimized, which could happen under various circumstances. We had to add some code to ensure that transformation always happened on that game.
- deciding to inform the game developer & wait for reply vs not waiting for reply vs just fixing it yourself without informing the developer; and
- if informed: developer actually fixing it vs only saying they would fix it vs no reply whatsoever (not counting automated "thank you for your inquiry" replies, in cases where you don't already have more direct channels to the dev than email)
I've always kind of wondered this because in a way, it's kind of weird that it's fixed for them, at least for new releases / games actively being developed.
(Full disclosure: I'm a game developer myself, with a very high interest in engine plumbing & dev [including graphics], though finding a job for the latter is easier said than done.)
I have very little understanding on how allocation works at OS level, but I'm surprised there are no wrappers like dgVoodoo or dxWrapper specifically for this kind of issues. There are quite a bunch of old Windows games (Need for Speed 1-4 for a start) that refuse to run on modern OSes due to rather...bold memory management strategies.
[1] - https://www.joelonsoftware.com/2000/05/24/strategy-letter-ii...
It's the life of a (game) developer...
I also remember a media player being called out by name in the code for doing invalid operations, needing a work around and code to detect it was running just to function.
An Nvidia employee once told me that one of the easiest ways to squeeze out a few extra frames on your old machine is to rename the game executable to hl2.exe.
And of course, browser engines also do the same things for certain websites:
https://github.com/WebKit/WebKit/blob/main/Source/WebCore/pa...
https://github.com/WebKit/WebKit/blob/main/Source/WebCore/pa...
What it should do is ensure some things not relevant to Half-Life 2 were not done, thus getting better performance for this game in particular, but there is no guarantee that same optimizations work for other applications or games, so one should not expect an overall improvement.
Unless they are doing some silly things like dropping quality, but that's the "everything else the same" point.
If not, why not have this enabled as default behavior instead?
This seems genuinely unbelievable. Does anyone have a technical explanation for this?
then driver "optimizes" behavior, sometimes dishonestly (reducing precision), sometimes honestly (working around game engine stupidity)
A lot of people use Nvidia profile inspector to enable reBar on all games and claim that Nvidia is purposely holding back performance, but doing this causes many games to crash.
Nvidia probably doesnt officially say anything about this and 99.9% of people do not rename process name
nvidia even has an official api for a game to identify itself so they dont need to look at executable name
Also there was one "that checked if you were printing a specific string used by a popular benchmark program. If so, then it only drew the string a quarter of the time and merely returned without doing anything the other three quarters of the time".
[1]: https://devblogs.microsoft.com/oldnewthing/20040305-00/?p=40...
Windows 95 patched a bug in SimCity just to get it to work.
I agree it would be stupid for a compiler to even support such a flag, but those were the 1980s/90s.
https://www.shlomifish.org/humour/by-others/funroll-loops/Ge...
Actually, the standard way of allocating 64 kB of memory on the stack is to just assume you can do it, subtract 64k from the stack pointer, and hope for the best.
Most stack allocations in the wild are not checked.
That is, until I checked the program I used for testing (which I didn't write), and found the following code:
dealloc(this)
return this->field
With the original allocator, this worked fine, since the deallocation didn't touch the memory.My allocator, however, overwrote the field during the deallocation with bookkeeping stuff, which meant the returned value was not what the programmer intended and after a short while the program crashed.
Unlike TFA, I had the luxury of just fixing the test program.
It means the fix was applied to run during the emulation loop execution, not that the fix was found and applied while the emulation loop was running.
Which would have made it an emulation code escape.
Agreed.
Ah, yes. Microsoft's!
solidity sweating profusely