I particularly like his JSON library [0], option parsing libraries [1], branchless UTF-8 decoder [2], lock-free stack [3], and trie library [4].
I also like his taste in licensing (all of the above is released under The Unlicense)
[0] https://github.com/skeeto/pdjson
[1] https://github.com/skeeto/optparse and https://github.com/skeeto/getopt
[2] https://github.com/skeeto/branchless-utf8
Branchless UTF-8 is a famous one, for example.
However, that said, I feel like the Avalanche + Bias idea is missing quite a bit.
Example: This function as the last listed function.
// exact bias: 0.020888578919738908
uint32_t
triple32(uint32_t x)
{
x ^= x >> 17;
x *= 0xed5ad4bbU;
x ^= x >> 11;
x *= 0xac4c1b51U;
x ^= x >> 15;
x *= 0x31848babU;
x ^= x >> 14;
return x;
}
Produces this image when implemented by FabriceNeyret2 on ShaderToy.https://www.shadertoy.com/view/WttXWX
or
https://i.imgur.com/qU2P5rx.png
However, a simple normal map slope differentiation shows a quite a few obvious "crystal" lines. There's probably a technical term for those types of ridges.
https://i.imgur.com/IHWT1GM.png
Edit: Also, this entire idea is half a decade old? https://nullprogram.com/blog/2018/07/31/
Is that a bad thing?
In any form of hash function, cryptography field, or information warfare, five years is an enormous length of time.
That means "somebody" in humanity has been brute force algorithm generating improved integer hash functions for the last half decade.
For all I know, Google, NVidia, Microsoft, NSA, Spetssvyaz, 3PLA, and Unit 8200 may have been in an integer hash function generation arms race for the last half decade.
Long Reuters article today on drones, and how the US is floundering (they're just not spending as much money as people want). [1] However, if drone can dodge better than enemy can shoot drone, then integer hash function is suddenly not "civilian". Missiles have known this forever. Talk to the MIRV industry.
[1] https://www.reuters.com/business/aerospace-defense/sea-drone...
But fast string search changed a lot in the last decade. The previous winners were all taken over by a new two-way algorithm as implemented in musl. This was a surprise. search is definitely more important than inthash. Maybe glibc will catchup in the next decade.
I didn't get an answer at 800-I-AM-MIRV, and don't know another way to contact this industry. Can you perhaps instead hint why you think that this industry might imply that non-cryptographic hashes are not "civilian"?
Cool to see this! I think it would be cool to hook it up with SMHasher3^1 (a much improved, and faster variant of the venerable hash test suite developed by Frank J. T. Wojcik) to automatically evaluate its output. You could use a subset of tests and fail fast for speed.
It would also be cool to expand it to 64 and 128 bit hashes (tho obviously the search space is larger).
Somewhat related I created some NodeJS code to measure avalanche under multiplication for 64-bit primes in order to select the values for Rain.
[Rain]: https://github.com/dosyago/rain
[SMHasher3]: https://gitlab.com/fwojcik/smhasher3
carryless multiply would also allow extending the set of reversible operations and are fast on some existent hardware (CRC too, somewhat a wider set of hardware but should be a strict subset of what CLMUL could find).
Many applications of hashes only care about the LSB or MSB of the hashes so evaluating the bias on MSB or LSB windows would be interesting (or mod varrious numbers), and some overall unbiased functions will look better or worse for metrics that don't gauge the entire output. (or for non-uniform inputs, for that matter, e.g. ascii text).
It outputs C code for the best hash function it generated.
So it's useful if you want a hash function, and don't think any of the existing ones are good enough. Or if you're researching hash functions, and want new ideas for structures.
Cool? Well, generating code is cool. Doing it randomly is a first step towards genetic programming, which would be even cooler. ... and making computers burn CPU cycles computing (mostly unused) hashes seems to be something we humans want to do since around 15 years ago.
Keep in mind that there's a big difference between cryptographic hash functions and the kind of hash functions investigated here.
A hash table is a remarkable data structure that can be used to implement many algorithms simply and efficiently.
This efficiency depends on being able to create small (say 32- or 64-bit) and nearly-unique “hashes” for your data. For example, to hash user names, it wouldn’t work very well to just use the ASCII code for the first letter of the names, because then a lot of usernames would map to the same number. That’s called a collision. If there are a lot of collisions, hash tables become very inefficient.
A better approach would be to grab bits from the entire username, and smash them together somehow so that “throwaway_1237” and “throwaway_12373” still username different numbers.
Hash functions do this mapping. The property of “avalanche” describes how good of a job they do at avoiding collisions.
There’s generally a tradeoff between (1) how fast the actual hash function is, and (2) how good of a job it does of avoiding collisions.
World-class hash functions look remarkably weird, multiplying and xoring and shifting by strange amounts. It’s very hard for humans to look at these cryptic functions and guess how well they will do.
So this code is trying out a bunch of hash functions at random and baking them off against each other. This is cool because success here could improve real-world performance of a core data structure used across many languages and libraries.
I made a test harness to check random constants (start values, multiplication constants, shift/rotate amounts, etc) and print out the best constants found so far based on the number of colliding buckets and how many collisions. I think I got it down to 1 bucket with just two colliding values with a ~40% fill rate. Interestingly, I found that the best performing constants included similar values for the number of positions to shift regardless of the other constants, so I ended up hard coding them.
When I did something similar, I was thinking about perfect hashing, where the set of inputs is known ahead of time. The usual approach uses a constant array, but I wanted to - in particular, if the inputs are already small integers, if I could compact them further (obviously this is possible: hash -= hash > > gap_index). I sat down and wrote out a list of ... probably 100? ... primitive operations (some of which are admittedly redundant with each other, but which are useful thinking of separately). And then I got bored and never did anything with the project.
What are those niceties, why is a reversible operation desirable in this context?
This is actually a lot easier than it sounds. As long as every n-bit integer operation used in the hash function is reversible, then the hash function has this property. An operation is reversible if, given its output, you can unambiguously compute its input."
2018: https://news.ycombinator.com/item?id=17659672 (14 comments)
2021: https://news.ycombinator.com/item?id=29117783 (1 comment)
Also, does anyone have pointers to a hash function discovery mechanism that discovers a good hash function for integer values within a specific range (i.e I know that my integer values are between, say, 10,000 and 200,000, for example) and I want to hash them into some optimal number of hash buckets?
If you're just looking for "good" the best approach is almost always to just use a normal hash function. If your numbers are extremely large and the range extremely small you can offset so the minimum becomes 0 again and use a smaller/faster hash. If you're more looking for "perfect choice of the exact range" I think the closest you'll get is something like this randomized approach but modify the tests to occur on your interval.
I have updated my StackOverflow answer at https://stackoverflow.com/questions/664014/what-integer-hash...
That being said, this sort of quick-and-dirty testing is useful to filter out obviously bad candidates at the early design phase of, say, an ARX primitive.
Very, very unlikely.
But I found nothing better than the old favorites.
I assume this means "indistinguishable through naive statistical tests", because any more literal meaning would surprise me
> (e.g. a random permutation of all 32-bit integers)
Since they only use bijective primitives, it's trivially "perfect" in the "perfect hash function" sense.
To elaborate further on the definition I was going off (which may well not be what the author meant):
> A PRF is considered to be good if its behavior is indistinguishable from a truly random function. Therefore, given an output from either the truly random function or a PRF, there should be no efficient method to correctly determine whether the output was produced by the truly random function or the PRF.
I saw at the top in the comments "Skeeto is a legend", this is the best compliment.
My sincere gratitude to the author.