The full scope of things hash functions are commonly used for requires at least four algorithms if you care about performance and optimality. It is disconcertingly common to see developers using hash algorithms in contexts where they are not fit for purpose. Gotta pick the right tool for the job.
For example, when you know that your input is uniformly randomly distributed, then truncation is a perfectly good hash function. (And a special case of truncation is the identity function.)
The above condition might sound too strong to be practical, but when you are eg dealing with UUIDs it is satisfied.
Another interesting hash function: length. See https://news.ycombinator.com/item?id=6919216 for a bad example. For a good example: consider rmlint and other file system deduplicators.
These deduplicators scan your filesystem for duplicates (amongst other things). You don't want to compare every file against every other file. So as a first optimisation, you compare files only by some hash. But conventional hashes like sha256 or crc take O(n) to compute. So you compute cheaper hashes first, even if they are weaker. Truncation, ie only looking at the first few bytes is very cheap. Determining the length of a file is even cheaper.
Hashes are _functions_ so provide the same output given the same input.
If you don't reseed the RNG after every hash computation, then you break this vital property of hashes.
And if you do reseed, then your claim boils down to "every hash function is just an XOR against a contstant" which certainly is not true either.
That might we one very particular way to write a hash function, but it's far from the only one.
Believe it or not, for some purposes taking the first few bytes of a string or even just the length of the string are good hash functions.
What are those purposes?
h = 5381
while still has data:
h = h * 33 + next_byte()
return h
PS of course if you think the multiplication is overkill, consider that it is nothing more than a shift and an addition.For instance, if the last operation during computing a hash function was the kind of integer multiplication where the result has the same length as the operands, then the top bits are the best (because they depend on more of the input bits than the bottom result bits).
On the other hand, if the last operation during computing a hash function was the kind of integer multiplication where the result has a length that is the sum of the lengths of the operands (i.e. where the high bits of the result are not discarded), then the best bits of the result are neither the top bits nor the bottom bits, but the middle bits, for the same reason as before, they are the bits that depend on more of the input bits than either the bottom bits or the top bits of the result.
When the last operation is an addition, because the carry bits propagate from the bottom bits towards the top bits, the top bits are the best, because the higher a bit is in the result there are more input bits on which it depends.
Nowadays swisstable and other similar hashtables use the top bits and simd/swar techniques to quickly filter out collisions after determining the starting bucket.
I'm perplexed to the claim that addition is cheaper than XOR, especially since addition is built upon XOR, am I missing anything? Is it javascript specific?
def hash(str):
len(str)
O(1), baby!In any case, for some applications this is indeed a great hash function. Programs like rmlint use it as part of their checks for duplicate files: if files have different lengths, they can't have the same content after all.
Actually, in Python, None is a valid key...
(I'm so sorry. JavaScript has ruined me.)
In something like C with its generic strings[1], it would surely have to be O(n) since you have to scan the entire string to calculate its length?
(I have always been terrible at big-O, mind.)
[0] There's probably more of them by now.
[1] ie. not a specific length-stored string type.
I get why technically it is a hash function, but still, no.
I guess I've never actually had this problem because was always hashing things that were static, or specialty cases like password hashes where the salt obviously guarantees uniqueness.
Whenever you map into a smaller space, you get collisions. The bigger space doesn't have to be infinite.
fn hash(data):
return data