Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hashing a string between two integers with a good distribution (Uniform Hash)

I'm trying to hash some strings between 0 and a very low n in order to give one color per user.

Here is my (working) code:

 function nameToColor(name) {
            var colors = ['red', 'blue', 'green', 'purple', 'orange', 'darkred', 'darkblue', 'darkgreen', 'cadetblue', 'darkpurple'];
            var hash = hashStr(name);
            var index = hash % colors.length;
            return colors[index];
        }

        //djb2 hash
        function hashStr(str) {
            var hash = 5381;
            for (var i = 0; i < str.length; i++) {
                var charCode = str.charCodeAt(i);
                hash = ((hash << 5) + hash) + charCode; /* hash * 33 + c */
            }
            return hash;    
        }

Unfortunately the low numbers are massively over-represented.

Question:

How can I write a deterministic javascript function that takes any string as argument and returns with a good (as uniform as possible) distribution a number between 0 and n?

like image 404
L. Sanna Avatar asked Jun 13 '13 09:06

L. Sanna


People also ask

What is a good hash function for integers?

A good hash function to use with integer key values is the mid-square method. The mid-square method squares the key value, and then takes out the middle r bits of the result, giving a value in the range 0 to 2r−1. This works well because most or all bits of the key value contribute to the result.

What is uniform hashing?

(algorithm) Definition: A conceptual method of open addressing for a hash table. A collision is resolved by putting the item in the next empty place given by a probe sequence which is independent of sequences for all other key.

Is hash function uniform distribution?

An important property of secure hash functions, like any hash function, is that they should uniformly cover their range. That is, if you place a uniform distribution on the inputs, the output probabilities from the hash function should be uniform.

What is good hash function in data structure?

Characteristics of a Good Hash Function. There are four main characteristics of a good hash function: 1) The hash value is fully determined by the data being hashed. 2) The hash function uses all the input data. 3) The hash function "uniformly" distributes the data across the entire set of possible hash values.


1 Answers

Hogan gave in comment a link to several hash implementation in javascript. It turns out that the most simple is the most appropriate:

function nameToColor(name) {
                var colors = ['red', 'blue', 'green', 'purple', 'orange', 'darkred', 'darkblue', 'darkgreen', 'cadetblue', 'darkpurple'];
                var hash = hashStr(name);
                var index = hash % colors.length;
                return colors[index];
        }

        //very simple hash
        function hashStr(str) {
            var hash = 0;
            for (var i = 0; i < str.length; i++) {
                var charCode = str.charCodeAt(i);
                hash += charCode;
            }
            return hash;
        }

I think it works well because it only uses the addition (no shift or multiplications) which leave the modulo unchanged, so the initial quality of distribution is conserved.

I also found this on wikipedia, but did not have to use it:

In many applications, the range of hash values may be different for each run of the program, or may change along the same run (for instance, when a hash table needs to be expanded). In those situations, one needs a hash function which takes two parameters—the input data z, and the number n of allowed hash values.

A common solution is to compute a fixed hash function with a very large range (say, 0 to 232 − 1), divide the result by n, and use the division's remainder. If n is itself a power of 2, this can be done by bit masking and bit shifting. When this approach is used, the hash function must be chosen so that the result has fairly uniform distribution between 0 and n − 1, for any value of n that may occur in the application. Depending on the function, the remainder may be uniform only for certain values of n, e.g. odd or prime numbers.

We can allow the table size n to not be a power of 2 and still not have to perform any remainder or division operation, as these computations are sometimes costly. For example, let n be significantly less than 2b. Consider a pseudo random number generator (PRNG) function P(key) that is uniform on the interval [0, 2b − 1]. A hash function uniform on the interval [0, n-1] is n P(key)/2b. We can replace the division by a (possibly faster) right bit shift: nP(key)>> b.

like image 115
L. Sanna Avatar answered Oct 28 '22 16:10

L. Sanna