FilterHN

When Greedy Algorithms Can Be Faster [C++]

63 points

by def-pri-pub

8 months ago

| past

| 7 comments

| 16bpp.net

| HN

▲

Negitivefrags

8 months ago

[-]

This isn't really what the article is about, but I don't think that the term "Greedy algorithm" means what the author thinks.

Greedy algorithms are about making locally optimal choices.

They are not "brute force" algorithms or inefficient ones.

In fact, greedy algorithms are almost always faster. They are faster because they consider only local information instead of the entire data set. In exchange for that, a greedy algorithm may produce non-optimal answers.

▲

adamgordonbell

8 months ago

[-]

Yeah, greedy is not what he thinks it is.

I think of change making algorithm when working retail. Greedy is you always use the largest coin you can, and then the next largest and so on. It works with sensible coin denominations but there are sets of coins where the greedy algorithm is not optimal.

▲

o11c

8 months ago

[-]

Relevant link: https://en.wikipedia.org/wiki/Change-making_problem

Looking at cases where greedy isn't optimal, I see two patterns. Either:

* there are two coins close in value (ratio of less than 2), e.g. both 20¢ and 25¢; or

* there is a "missing" coin at the GCD of two larger coins.

I'm pretty sure you can precompute extra "virtual coins" (e.g. 40¢) to make greedy optimal again, but you have to place restrictions on how many of them you're allowed to use.

▲

nxobject

8 months ago

[-]

I don't know how numerics in hardware works, but would the use of functions like sin, cos, sqrt incur a penalty as well, even if only a slight one?

It's really fascinating to think about how all of this would work.

▲

hinkley

8 months ago

[-]

Very early games had lookup tables for trig functions. The cpu instructions were too slow or missing. The tables were either generated at run time or statically defined in the code.

I think that’s one of those things Jai and Zig agree on - compile time functions have a place in preventing magic numbers that cannot be debugged.

▲

akoboldfrying

8 months ago

[-]

Yes, and at least for sqrt(), internally it's likely implemented as a heuristic guess followed by a fixed number of iterations of Newton's Method. (In software, you'd normally iterate Newton's Method until the change in the result is less than some threshold; in hardware, I'm guessing that it might be simpler to figure out the maximum number of iterations that would ever be needed for any input, and always run that many, but I don't know.)

▲

Const-me

8 months ago

[-]

> at least for sqrt(), internally it's likely implemented as a heuristic guess

Square roots are implemented in hardware: https://www.felixcloutier.com/x86/sqrtsd

> In software, you'd normally iterate Newton's Method

Software normally computes trigonometric functions (and other complicated ones like exponents and std::erf) with a high-degree polynomial approximation.

▲

akoboldfrying

8 months ago

[-]

>Square roots are implemented in hardware

But how does that hardware implementation work internally?

The point I'm trying to make is that it is probably an (in-hardware) loop that uses Newton's Method.

ETA: The point being that, although in the source code it looks like all looks have been eliminated, they really haven't been if you dig deeper.

▲

Const-me

8 months ago

[-]

> how does that hardware implementation work internally?

I don’t know, but based on performance difference between FP32 and FP64 square root instructions, the implementation probably produces 4-5 bits of mantissa per cycle.

▲

defrost

8 months ago

[-]

There are other methods used in hardware, eg (for example)

https://en.wikipedia.org/wiki/Methods_of_computing_square_ro...

Something like Heron's method is a special case of Newton's method.

▲

akoboldfrying

8 months ago

[-]

Interesting that your linked algorithm manages to avoid costly divisions, but it uses an even longer loop than Newton's Method -- one iteration for every 2 bits. NM converges quadratically, doubling the number of correct bits each time, so a 32-bit number won't ever need more than 5 iterations, assuming >= 1 bit of accuracy in the initial estimate, which is easy to obtain just by shifting right by half the offset of the highest 1-bit.

▲

defrost

8 months ago

[-]

There are trade-offs (constant time, perhaps?) and many differing applications ...

For example: Pipelined RISC DSP chips have fat (many parallel streams) "one FFT result per clock cycle" pipelines that are rock solid (no cache hits or jitter).

The setup takes a few cyces but once primed it's

aquired data -> ( pipeline ) -> processed data

every clock cycle (with a pipeline delay, of course).

In that domain hardware implementations are chosen to work well with vector calculations and with consistent capped timings.

( To be clear, I haven't looked into that specific linked algo, I'm just pointing out it's not a N.R. only world )

▲

adgjlsfhk1

8 months ago

[-]

sqrt is the one exception to this. the newton series is really good and the polynomials aren't great (and the newton based approach prevents you from having to do range reduction)

▲

shihab

8 months ago

[-]

yeah, that's very likely the explanation. All these functions are pretty high latency instructions, vs rejection sampling which only involves a multiplication. On Nvidia GPUs, mul has latency of 1-4 cycles while others are 16-32.

▲

emtel

8 months ago

[-]

I’m surprised they didn’t mention the reason that the rejection sampling method is surprisingly fast: the probability of needing to resample N times decreases exponentially with N. So even for the 3D case, where 50% of your samples get rejected, the EV for number of samples required is about 2.

This is also a good case study on the difference between throughput sensitive vs latency sensitive applications. The rejection sampling is fast on average, but the analytical solution likely has a much tighter upper bound on how long it can take. In a throughput application you care about EV. But if you need to minimize latency, then you care about the upper bound.

▲

psanchez

8 months ago

[-]

I think I would call it naive algorithm rather than greedy.

It looked like an interesting problem so I spent some time this morning exploring if there would be any performance improvement by pregenerating an array of X items (where X around 1M to 16M items) and then randomly returning one of them at a time. I explored the project and copied the functions to be as faithful as the original implementation as possible.

Generating 10M unit sphere (best of 3 runs, g++ 13, linux, Intel i7-8565U, one core for tests):

  - naive/rejection: ~574ms
  - analytical: ~1122ms
  - pregen 1M elements: ~96ms

That's almost 6x faster than rejection method. Setup of the 1M elements is done once and does not count on the metrics. Using double type, using float yields around 4x improvements.

After looking at those results I decided to try on the project itself, so I downloaded, compiled and applied similar optimizations in the project, only updating circle and sphere random generators (with 16M unit vectors that are only created once on app lifetime) but got almost no noticeable benefits (marginal at most). Hard to tell because of the random nature of the raytracing implementation. On the bright side the image quality was on par. Honestly I was afraid this method would generate poor visuals.

Just for the record, I'm talking about something as simple as:

  std::vector<Vec3> g_unitSpherePregen;
  uint32_t g_unitSpherePregenIndex = 0;

  void setupUnitSpherePregen(uint32_t nElements) {
    g_unitSpherePregen.resize(nElements);
    for (auto i = 0; i < nElements; i++) {
      g_unitSpherePregen[i] = unitSphereNaive();  // call the original naive or analytical method
    }
  }

  Vec3 unitSpherePregen() {
    g_unitSpherePregenIndex = (g_unitSpherePregenIndex + 1) % g_unitSpherePregen.size();
    return g_unitSpherePregen[g_unitSpherePregenIndex];
  }

I tried as well using a psrng (std::mt19937 and xorshf96) in unitSpherePregen instead of the incremented variable, but increment was faster and yielded good visual results.

Next step would be profiling, but I don't think I will invest more time on this.

Edit: fix formatting

▲

camel-cdr

8 months ago

[-]

I also came up with an alternative implementation: https://gist.github.com/camel-cdr/d16fd2be1fd7b71622649e6bc7...

The idea is based on the Ziggurat Method. You overlap the circle with n boxes that each encapsulate the same amount of area of the underlying circle, select a random box, and then do rejection.

With 128 boxes, this reduces the average number of additional iterations from 27% to 0.7%, which should massively reduce the number of branch miss predictions.

It ended up about 2x faster the simple rejection method.

I haven't applied this to spheres yet, but that should also be possible.

▲

psanchez

8 months ago

[-]

Didn't know about Ziggurat algorithm, the use of a table to directly accept or reject is interesting, although I think I would need to implement myself to fully understand. Your comments in the code are great, but I still need I would need to dedicate some time to fully grasp it.

I'm wondering what if a 2D or 3D array was used instead, so that instead of working with the unit circle / unit sphere, you worked on a 256x circle/sphere.

Assuming the center of the circle/sphere was on the position (127, 127) or (127, 127, 127), then you could precompute which of those elements in the array would be part of that 256 sphere/circle radius and only the elements in the boundary of the circle/sphere would need to be marked as special. You would only need 3 values (2 bits per item).

   0 = not in the circle/sphere
   1 = in the circle/sphere
   2 = might be in or out

Then you would only need to randomly pick a point and just a lookup to evaluate whether is on the 2d/3d array. Most of the times simple math would be involved and simple accept/reject would cause it to return a value. I guess it would also produce the number of additional retries to 0.7% on a circle (one circle intersection for every 128 tiems = 1/128 = 0.78%).

From my limited understanding, what I'm saying looks like a simpler implementation but would require more memory and in the end would have the same runtime performance as yours (assuming memory and processor caches were the same, which are probably not). Uhm... I guess the implementation you present is actually doing something similar but with a quarter of the circle, so you need less memory.

Interesting, thanks for sharing.

▲

Labo333

8 months ago

[-]

To accelerate the analytical solution, there are more obvious techniques, like replacing trigonometric opetations with series. This is a 10x improvement in https://ashvardanian.com/posts/google-benchmark/#slow-trigon...

▲

cozzyd

8 months ago

[-]

I suspect if you started using SIMD instructions, the analytical case would get better again (since it's branchless).

▲

nxobject

8 months ago

[-]

Apropos of SIMD – I'm also surprised that the inner loop of the rejection-based algorithm optimized to MMX, but not the analytic algorithm!

I would like to think the rejection algorithm after -O3 is benefiting from branch prediction and all sorts of modern speculation optimizations. But I imagine the real test of that would be running these benchmarks would be running these benchmarks on a 5-10ish year old uarch.

▲

Sesse__

8 months ago

[-]

Ten years ago, you already have Skylake with pretty good indirect branch predictors...

▲

Veliladon

8 months ago

[-]

It doesn't matter because you can't predict random data which is what the example is using. Branch predictors work on the premise that 95% of the time the branch resolves the same way which is damn common in most code. The closer your branches look to a normal distribution vs bimodal, the worse the branch predictor is going to perform.

The rejector is going to be faster because more iterations can be kept in the reorder buffer at once. The CPU is going to have to keep all the instructions post-branch in the reorder buffer until the branch is retired. If it's waiting on an instruction with an especially long latency like a square root it's probably speculatively already finished the work anyway, rejected or not.

If the branch rejects after the work is done the reorder buffer can retire each instruction of the random generation algorithm as it comes and then wait on however many branches at the end which can also be run in parallel because those branches aren't dependent on each other. All those branches which are doing a square root will also be pipelined properly instead of putting bubbles everywhere.

▲

Sesse__

8 months ago

[-]

My point was that there hasn't been a revolution in branch prediction (or really any sort of speculation) the last 5–10 years. Things have become incrementally better as always, and the M1 presumably has some sort of pointer value speculation, but by and large, things are generally similar. So running on a 5–10 year old microarchitecture won't change the qualitative results much.

If you talked about running on a 30-year-old architecture, then sure, that would tell you something about the effect of modern speculation on the code.

▲

shihab

8 months ago

[-]

On GPU, the rejection sampling approach wouldn't come close to analytical one.

▲

camel-cdr

8 months ago

[-]

With modern SIMD the rejection method is likely faster, because you can do it branchless with a compressed store.

▲

chmod775

8 months ago

[-]

Isn't there a likely faster analytical solution? At least I think sqrt, sin, and cos are generally considered to be slow.

Maybe this:

1. Generate random X [-1 to 1]

2. Based on X you can calculate a range of legal Y that will still fall within the circle. Try to use a lookup table+interpolation with desired resolution to make this really fast if there's not some sufficiently fast hardware instructions we can abuse.

3. Generate random Y in that range.

▲

Sesse__

8 months ago

[-]

> At least I think sqrt, sin, and cos are generally considered to be slow. […] Based on X you can calculate a range of legal Y that will still fall within the circle.

Yup, that calculation is called sqrt(1 - x²).

> Try to use a lookup table+interpolation with desired resolution to make this really fast

On modern hardware (like, most architectures from 1995-ish onwards), LUTs are mostly slower than polynomial-based solutions for the same level of accuracy.

> 1. Generate random X [-1 to 1]

As others have pointed out, this is already wrong unless you intend to reject some of the solutions later.

▲

taco9999

8 months ago

[-]

A potential issue with that is needing to adjust the distribution of either X or Y, since if X and Y remain uniform, then there will be a lot of points packed into the sides of the circle with higher density than the center column.