Greedy algorithms are about making locally optimal choices.
They are not "brute force" algorithms or inefficient ones.
In fact, greedy algorithms are almost always faster. They are faster because they consider only local information instead of the entire data set. In exchange for that, a greedy algorithm may produce non-optimal answers.
I think of change making algorithm when working retail. Greedy is you always use the largest coin you can, and then the next largest and so on. It works with sensible coin denominations but there are sets of coins where the greedy algorithm is not optimal.
Looking at cases where greedy isn't optimal, I see two patterns. Either:
* there are two coins close in value (ratio of less than 2), e.g. both 20¢ and 25¢; or
* there is a "missing" coin at the GCD of two larger coins.
I'm pretty sure you can precompute extra "virtual coins" (e.g. 40¢) to make greedy optimal again, but you have to place restrictions on how many of them you're allowed to use.
It's really fascinating to think about how all of this would work.
I think that’s one of those things Jai and Zig agree on - compile time functions have a place in preventing magic numbers that cannot be debugged.
Square roots are implemented in hardware: https://www.felixcloutier.com/x86/sqrtsd
> In software, you'd normally iterate Newton's Method
Software normally computes trigonometric functions (and other complicated ones like exponents and std::erf) with a high-degree polynomial approximation.
But how does that hardware implementation work internally?
The point I'm trying to make is that it is probably an (in-hardware) loop that uses Newton's Method.
ETA: The point being that, although in the source code it looks like all looks have been eliminated, they really haven't been if you dig deeper.
I don’t know, but based on performance difference between FP32 and FP64 square root instructions, the implementation probably produces 4-5 bits of mantissa per cycle.
https://en.wikipedia.org/wiki/Methods_of_computing_square_ro...
Something like Heron's method is a special case of Newton's method.
For example: Pipelined RISC DSP chips have fat (many parallel streams) "one FFT result per clock cycle" pipelines that are rock solid (no cache hits or jitter).
The setup takes a few cyces but once primed it's
aquired data -> ( pipeline ) -> processed data
every clock cycle (with a pipeline delay, of course).
In that domain hardware implementations are chosen to work well with vector calculations and with consistent capped timings.
( To be clear, I haven't looked into that specific linked algo, I'm just pointing out it's not a N.R. only world )
This is also a good case study on the difference between throughput sensitive vs latency sensitive applications. The rejection sampling is fast on average, but the analytical solution likely has a much tighter upper bound on how long it can take. In a throughput application you care about EV. But if you need to minimize latency, then you care about the upper bound.
It looked like an interesting problem so I spent some time this morning exploring if there would be any performance improvement by pregenerating an array of X items (where X around 1M to 16M items) and then randomly returning one of them at a time. I explored the project and copied the functions to be as faithful as the original implementation as possible.
Generating 10M unit sphere (best of 3 runs, g++ 13, linux, Intel i7-8565U, one core for tests):
- naive/rejection: ~574ms
- analytical: ~1122ms
- pregen 1M elements: ~96ms
That's almost 6x faster than rejection method. Setup of the 1M elements is done once and does not count on the metrics. Using double type, using float yields around 4x improvements.After looking at those results I decided to try on the project itself, so I downloaded, compiled and applied similar optimizations in the project, only updating circle and sphere random generators (with 16M unit vectors that are only created once on app lifetime) but got almost no noticeable benefits (marginal at most). Hard to tell because of the random nature of the raytracing implementation. On the bright side the image quality was on par. Honestly I was afraid this method would generate poor visuals.
Just for the record, I'm talking about something as simple as:
std::vector<Vec3> g_unitSpherePregen;
uint32_t g_unitSpherePregenIndex = 0;
void setupUnitSpherePregen(uint32_t nElements) {
g_unitSpherePregen.resize(nElements);
for (auto i = 0; i < nElements; i++) {
g_unitSpherePregen[i] = unitSphereNaive(); // call the original naive or analytical method
}
}
Vec3 unitSpherePregen() {
g_unitSpherePregenIndex = (g_unitSpherePregenIndex + 1) % g_unitSpherePregen.size();
return g_unitSpherePregen[g_unitSpherePregenIndex];
}
I tried as well using a psrng (std::mt19937 and xorshf96) in unitSpherePregen instead of the incremented variable, but increment was faster and yielded good visual results.Next step would be profiling, but I don't think I will invest more time on this.
Edit: fix formatting
The idea is based on the Ziggurat Method. You overlap the circle with n boxes that each encapsulate the same amount of area of the underlying circle, select a random box, and then do rejection.
With 128 boxes, this reduces the average number of additional iterations from 27% to 0.7%, which should massively reduce the number of branch miss predictions.
It ended up about 2x faster the simple rejection method.
I haven't applied this to spheres yet, but that should also be possible.
I'm wondering what if a 2D or 3D array was used instead, so that instead of working with the unit circle / unit sphere, you worked on a 256x circle/sphere.
Assuming the center of the circle/sphere was on the position (127, 127) or (127, 127, 127), then you could precompute which of those elements in the array would be part of that 256 sphere/circle radius and only the elements in the boundary of the circle/sphere would need to be marked as special. You would only need 3 values (2 bits per item).
0 = not in the circle/sphere
1 = in the circle/sphere
2 = might be in or out
Then you would only need to randomly pick a point and just a lookup to evaluate whether is on the 2d/3d array. Most of the times simple math would be involved and simple accept/reject would cause it to return a value. I guess it would also produce the number of additional retries to 0.7% on a circle (one circle intersection for every 128 tiems = 1/128 = 0.78%).From my limited understanding, what I'm saying looks like a simpler implementation but would require more memory and in the end would have the same runtime performance as yours (assuming memory and processor caches were the same, which are probably not). Uhm... I guess the implementation you present is actually doing something similar but with a quarter of the circle, so you need less memory.
Interesting, thanks for sharing.
I would like to think the rejection algorithm after -O3 is benefiting from branch prediction and all sorts of modern speculation optimizations. But I imagine the real test of that would be running these benchmarks would be running these benchmarks on a 5-10ish year old uarch.
The rejector is going to be faster because more iterations can be kept in the reorder buffer at once. The CPU is going to have to keep all the instructions post-branch in the reorder buffer until the branch is retired. If it's waiting on an instruction with an especially long latency like a square root it's probably speculatively already finished the work anyway, rejected or not.
If the branch rejects after the work is done the reorder buffer can retire each instruction of the random generation algorithm as it comes and then wait on however many branches at the end which can also be run in parallel because those branches aren't dependent on each other. All those branches which are doing a square root will also be pipelined properly instead of putting bubbles everywhere.
If you talked about running on a 30-year-old architecture, then sure, that would tell you something about the effect of modern speculation on the code.
Maybe this:
1. Generate random X [-1 to 1]
2. Based on X you can calculate a range of legal Y that will still fall within the circle. Try to use a lookup table+interpolation with desired resolution to make this really fast if there's not some sufficiently fast hardware instructions we can abuse.
3. Generate random Y in that range.
Yup, that calculation is called sqrt(1 - x²).
> Try to use a lookup table+interpolation with desired resolution to make this really fast
On modern hardware (like, most architectures from 1995-ish onwards), LUTs are mostly slower than polynomial-based solutions for the same level of accuracy.
> 1. Generate random X [-1 to 1]
As others have pointed out, this is already wrong unless you intend to reject some of the solutions later.