https://www.reddit.com/r/Games/comments/v42611/dario_on_twit...
It probably needs generative AI based upscaling to high resolution meshes, textures and realistic materials to actually achieve a quality improvement.
right, because that's the not point, the point is to be like N64-era graphics at 60fps and greater, with widescreen, motion blur, and other things that have high engagement amongst gamers (whether the self-proclaimed gamers like them or not)
I don't think he uses either one in serious code, but if he did, good luck emulating it.
I may sound way out of the loop here, but... How come this was never a problem for older dx9/dx11/GL games and emulators?
Most modern emulators implement a shader cache which stores those shaders as they are encountered so that this "compilation stutter" only happens once per shader - but modern titles can have hundreds or thousands of shaders and that means on a playthrough you're pretty much encountering it consistently. Breath of the Wild was one that stands out as a game where you basically had to run it with precompiled shader caches as it was borderline unplayable without it.
Ubershaders act like fallback shaders - using an off the shelf precompiled "particle" shader vs the actual one, while the actual one is compiled for use next time - this prevents the stutter at a cost of visual fidelity. If you see an explosion in a game, it will be a generic explosion shader vs the actual one used in the game, until it is available in the shader cache.
I believe the term was coined by Dolphin team, who did a pretty good high level writeup of the feature here:
The classic usage is a single source shader which is specialized using #define's and compiled down to hundreds of shaders. This is what Christer uses in that blog post above (and Aras does as well in his ubershader blog post)
Dolphin used it to mean a single source shader that used runtime branches to cover all the bases as a fallback while a specialized shader was compiled behind the scenes.
The even more modern usage now is a single source shader that only uses runtime branches to cover all the features, without any specialization behind the scenes, and that's what Dario means here.
1. A global, networked shader cache — where when any instance of the emulator encounters a new shader, it compiles it, and then pushes the KV-pair (ROM hash, target platform, console shader object-code hash)=(target-platform shader object-code) into some KV server somewhere; and some async process comes along periodically to pack all so-far-submitted KV entries with a given (ROM hash, target platform) prefix into shader-cache packfiles. On first load of a game, the emulator fetches the packfile if it exists, and loads the KV pairs from it into the emulator's local KV cache. (In theory, the emulator could also offer the option to fetch global-shader-cache-KV-store "WAL segment" files — chunks of arbitrary global-shader-cache KV writes — as they're published on a 15-minute-ly basis. Or KV entries for given (ROM hash, target) prefixes could be put into message-queue topics named after those prefixes, to which running instances of the emulator could subscribe. These optimizations might be helpful when e.g. many people are playing a just-released ROMhack, where no single person has yet run through the whole game to get it in the cache yet. Though, mind you, the ROMhack's shaders could already be cached into the global store before release, if the ROMhacker used the emulator during development... or if they knew about this, and were considerate enough to use some tool created by the emulator dev to explicitly compile + submit their raw shader project files into the global KV store.)
2. Have the emulator (or some separate tool) "mine out" all the [statically-specified] shaders embedded from the ROM, as a one-time process. (Probably not just a binwalk, because arbitrary compression. Instead, think: a concolic execution of the ROM, that is looking for any call to the "load main-memory region into VRAM as shader" GPU instruction — where there is a symbolically-emulated memory with regions that either have concrete or abstract values. If the RAM region referenced in this "load as shader" instruction is statically determinable — and the memory in that region has a statically-determinable value on a given code-path — then capture that RAM region.) Precompile all shaders discovered this way create a "perfect" KV cachefile for the game. Publish this into a DHT (or just a central database) under the ROM's hash. (Think: OpenSubtitles.org)
Mind you, I think the best strategy would actually combine the two approaches — solution #2 can virtually eliminate stutter with a single pre-processing step, but it doesn't allow for caching of dynamically-procedurally-generated shaders. Solution #1 still has stutter for at least one player, one time, for each encountered shader — but it handles the case of dynamic shaders.
The best experience so far is downloading an additional shader cache alongside the ROM - in some rom formats these can also be included alongside the ROM which acts like a dictionary and can facilitate loading into the emulator vs having to add it as a "mod" for that specific game. Adding this to a DHT type network for "shader sharing" would be great but might open the door to some abuse (shaders are run at hardware level and there are some examples of malicious shaders out there) - Plus you'd be exposing the games you're playing to the dht network.
Anyway - Just a succinct example of the level of effort that goes into making an emulator "just work".
I don't want to be snippy, but — I don't think you understood the rest of the paragraph you're attempting to rebut here, since this is exactly (part of) what I said myself. (I wouldn't blame you if you didn't understand it; the concept of "concolic execution" is probably familiar to maybe ~50000 people worldwide, most of them people doing capital-S Serious static-analysis for work in cryptanalysis, automated code verification, etc.)
To re-explain without the jargon: you wouldn't be "mining" the shaders as data-at-rest; rather, you'd be "running" the ROM under a semi-symbolic (symbolic+concrete — concolic!) interpreter, one that traverses all possible code-paths "just enough" times to see all "categories of states" (think: a loop's base-case vs its inductive case vs its breakout case.) You'd do this so that, for each "path of states" that reaches an instruction that tells the console's GPU "this here memory, this is a shader now", the interpreter could:
1. look back at the path that reached the instruction;
2. reconstruct a sample (i.e. with all irrelevant non-branch-determinant values fixed to placeholders) concrete execution trace; and then
3. concretely "replay" that execution trace, using the emulator itself (but with no IO peripherals hooked up, and at maximum speed, and with no need for cycle-accurate timed waits since inter-core scheduling is pre-determined in the trace);
4. which would, as a side-effect, "construct" each piece of shader object-code into memory — at a place where the interpreter is expecting it, given the symbolic "formula" node that the interpreter saw passed into the instruction ("formula node": an AST subtree built out of SSA-instruction branch-nodes and static-value leaf-nodes, referenced by a versioned Single-Static-Information cell, aliasable into slices within CPU-register ADTs, or into a layered+sparse memory-cell interval-tree ADT);
5. so that the interpreter can then pause concrete emulation at the same "load this as a shader" instruction; reach into the emulator's memory where the "formula node" said to look; and grab the shader object-code out.
If you know how the AFL fuzzer works, you could think of this as combining "smart fuzzing" (i.e. looking at the binary and using it to efficiently discover the "constraint path" of branch-comparison value-ranges that reaches each possible state); with a graph-path-search query that "directs" the fuzzer down only paths that reach states we're interested in (i.e. states that reach a GPU shader load instruction); and with an observing time-travelling debugger/tracer connected, to then actually execute the discovered "interesting" paths up to the "interesting" point, to snapshot the execution state at that point and extract "interesting" data from it.
---
Or, at least, that's how it works in the ideal case.
(In the non-ideal case, it's something you can't resolve because the "formula" contains nodes that reference things the interpreter can't concretely emulate without combinatoric state-space explosion — e.g. "what was loaded from this save file created by an earlier run of the game process"; or maybe "what could possibly be in RAM here" when the game uses multiple threads and IPC, and relies on the console OS to pre-emptively schedule those threads, so that "when a message arrives to a thread's IPC inbox" becomes non-deterministic. So this wouldn't work for every game. But it could work for some. And perhaps more, if you can have your concolic interpreter present a more-stable-than-reality world by e.g. "coercing processors into a fake linear clock that always pulses across the multiple CPU cores in a strict order each cycle"; or "presenting a version of the console's OS/BIOS that does pre-emptive thread scheduling deterministically"; etc.)
Compilation stutters was perhaps less noticeable in the DX9/OpenGL 3 era because shaders were less capable, and games relied more on fixed functionality which was implemented directly in the driver. Nowadays, a lot of the legacy API surface is actually implemented by dynamically written and compiled shaders, so you can get shader compilation hitches even when you aren’t using shaders at all.
In the N64 era of consoles, games would write ISA (“microcode”) directly into the GPU’s shared memory, usually via a library. In Nintendo’s case, SGI provided two families of libraries called “Fast3D” and “Turbo3D”. You’d call functions to build a “display list”, which was just a buffer full of instructions that did the math you wanted the GPU to do.
I think a big part of the user-visible difference in stutter is simply the expected complexity of shaders and number of different shaders in an "average" scene - they're 100s of times larger, and CPUs aren't 100s of times faster (and many of the optimization algorithms used are more-than-linear in terms of time vs the input too)
Modern DXIL and SPIR-V are at a similar level of abstraction to DXBC, and certainly don't "solve" stutter.
In DirectX on PC, shaders have been compiled into an intermediate form going back to Direct3D 8. All of these intermediate forms are lowered into an ISA-specific instruction set by the drivers.
This final compilation step is triggered lazily when a draw happens, so if you are working on a "modern" engine that uses thousands of different material types your choices to handle this are to a) endure a hiccup as these shaders are compiled the first time they are used, b) force compilation at a load stage (usually by doing like a 1x1 pixel draw), or c) restructure the shader infrastructure by going to a megashader or similar.
When targeting Apple platforms, you can use the metal-tt tool to precompile your shaders to ISA. You give it a list of target triples and a JSON file that describes your PSO. metal-tt comes with Xcode, and is also available for Windows as part of the Game Porting Toolkit.
Unfortunately, most people don’t do that. They’re spoiled by the Steam monoculture, in which Steam harvests the compiled ISA from gamers’ machines and makes it available on Valve’s CDN.
No, that's not correct. In fact, it's mostly the other way around. Consoles have known hardware and thus games can ship with precompiled shaders. I know this has been done since at least PS2 era since I enjoy taking apart game assets.
While on PC, you can't know what GPU is in the consumer device.
For example, Steam has this whole concept of precompiled shader downloads in order to mitigate the effect for the end user.
That's what I said. Consoles ship GPU machine code, PCs ship textual shaders (in the case of OpenGL) or some intermediate representation (DXIL, DXBC, SPIRV, ...)
https://github.com/KhronosGroup/Vulkan-Docs/blob/main/propos...
The gist of it is that graphics APIs like DX11 were designed around the pipelines being compiled in pieces, each piece representing a different stage of the pipeline. These pieces are then linked together at runtime just before the draw call. However the pieces are rarely a perfect fit requiring the driver to patch them or do further compilation, which can introduce stuttering.
In an attempt to further reduce stuttering and to reduce complexity for the driver Vulkan did away with these piece-meal pipelines and opted for monolithic pipeline objects. This allowed the application to pre-compile the full pipeline ahead of time alleviating the driver from having to piece the pipeline together at the last moment.
If implemented correctly you can make a game with virtually no stuttering. DOOM (2016) is a good example where the number of pipeline variants was kept low so it could all be pre-compiled and its gameplay greatly benefits from the stutter-free experience.
This works great for a highly specialized engine with a manageable number of pipeline variants, but for more versatile game engines and for most emulators pre-compiling all pipelines is untenable, the number of permutations between the different variations of each pipeline stage is simply too great. For these applications there was no other option than to compile the full pipeline on-demand and cache the result, making the stutter worse than before since there is no ability to do piece-meal compilation of the pipeline ahead of time.
This gets even worse for emulators that attempt to emulate systems where the pipeline is implemented in fixed-function hardware rather than programmable shaders. On those systems the games don't compile any piece of the pipeline, the game simply writes to a few registers to set the pipeline state right before the draw call. Even piece-meal compilation won't help much here, thus ubershaders were used instead to emulate a great number of hardware states in a single pipeline.
Driver caches mean that after everything gets "prewarmed", it won't happen again.
That's not how any modern GPU works though. Instead, you have to emulate this semi-fixed-function pipeline with shaders. Emulators try to generate shader code for the current GPU configuration and compile it, but that takes time and can only be done after the configuration was observed for the first time. This is where "Ubershaders" enter the scene: they are a single huge shader which implements the complete configurable semi-fixed-function pipeline, so you pass in the configuration registers to the shader and it acts accordingly. Unfortunately, such shaders are huge and slow, so you don't want to use them unless it's necessary. The idea is then to prepare "ubershaders" as fallback, use them whenever you see a new configuration, compile the real shader and cache it, and use the compiled shader once it's available instead of the ubershader, to improve performance again. A few years ago, the developers of the Dolphin emulator (GameCube/Wii) wrote an extensive blog post about how this works: https://de.dolphin-emu.org/blog/2017/07/30/ubershaders/
Only starting with the 3DS/Wii U, Nintendo consoles finally got "real" programmable shaders, in which case you "just" have to translate them to whatever you need for your host system. You still won't know which shaders you'll see until you observe the transfer of the compiled shader code to the emulated GPU. After all, the shader code is compiled ahead of time to GPU instructions, usually during the build process of the game itself. At least for Nintendo consoles, there are SDK tools to do this. This, of course, means, there is no compilation happening on the console itself, so there is no stutter caused by shader compilation either. Unlike in an emulation of such a console, which has to translate and recompile such shaders on the fly.
> How come this was never a problem for older [...] emulators?
Older emulators had highly inaccurate and/or slow GPU emulation, so this was not really a problem for a long time. Only once the GPU emulations became accurate enough with dynamically generated shaders for high performance, the shader compilation stutters became a real problem.
The N64 did in fact have a fully programmable pipeline. [1] At boot, the game initialized the RSP (the N64’s GPU) with “microcode”, which was a program that implemented the RSP’s graphics pipeline. During gameplay, the game uploaded “display lists” of opcodes to the GPU which the microcode interpreted. (I misspoke earlier by referring to these opcodes as “microcode”.) For most of the console’s lifespan, game developers chose between two families of vendor-authored microcode: Fast3D and Turbo3D. Toward the end, some developers (notably Factor5) wrote their own microcode.
https://www.unrealengine.com/en-US/tech-blog/game-engines-an...