Some notes about using a prime modulus in a hash function

10 Jun 2016

Modular hashing is a common technique to convert a numeric key into an array index in hash tables. Given a hash function in the form

H(n) = n mod m

it’s stated in many textbooks and articles that using a prime modulus m tends to disperse hash values evenly. One explanation is that “If m is not prime, it may be the case that not all of the bits of the key play a role, which amounts to missing an opportunity to disperse the values evenly” (Sedgewick and Wayne, p. 460). It’s not obvious why this statement holds, since n can be represented as a m-base number (n=∑a_i · mⁱ) and only the last digit plays a role.

Others say that (any) n and m should be co-prime (and a prime m has a better chance). This condition gives evenly distributed hash values, but it’s too strong. For example, if all possible n are evenly distributed, hash values follow an even distribution no matter what m is.

It turns out that the distribution of hash values is related to the greatest common divisor (GCD) of all possible n and m. To demonstrate this, we rewrite the above hash function into

n = a · m + H(n)

where a is a none negative integer. Assuming the GCD of all possible n and m is d, then we have

n = i · d = a · j · d + H(n)

and therefore

H(n) = d · (i - a · j)

Since H(n) has d as a factor, hash values are evenly distributed in range [0, m-1] iff d = 1.

If all possible n and m are co-prime, then d = 1.
If n follows a even distribution, then d = 1.
If the distribution of n is unknown, a prime m increase the probability to minimize d, and thus hash values disperse more evenly.

Madeleines

Some notes about using a prime modulus in a hash function

Related Posts

MongoDB authentication work flow at TCP level 10 Feb 2016

Rational for Church encoding 02 Feb 2016

Y combinator from λ to Javascript 28 Jan 2016