Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

hash functions family generator in python

I am looking for a hash functions family generator that could generate a family of hash functions given a set of parameters. I haven't found any such generator so far. Is there a way to do that with the hashlib package ?

For example I'd like to do something like :

h1 = hash_function(1)
h2 = hash_function(2)
...

and h1 and h2 would be different hash functions.

For those of you who might know about it, I am trying to implement a min-hashing algorithm on a very large dataset.

Basically, I have a very large set of features (100 millions to 1 billion) for a given document, and I need to create 1000 to 10000 different random permutations for this set of features.

I do NOT want to build the random permutations explicitly so the technique I would like to use in the following :

  1. generate a hash function h and consider that for two indices r and s
  2. r appears before s in the permutation if h(r) < h(s) and do that for 100 to 1000 different hash functions.

Are there any known libraries that I might have missed ? Or any standard way of generating families of hash functions with python that you might be aware of ?

like image 460
Nicolas M. Avatar asked Feb 12 '10 22:02

Nicolas M.


People also ask

What is a family of hash functions?

In computer science, a family of hash functions is said to be k-independent, k-wise independent or k-universal if selecting a function at random from the family guarantees that the hash codes of any designated k keys are independent random variables (see precise mathematical definitions below).

What are hash functions in Python?

Python hash() function is a built-in function and returns the hash value of an object if it has one. The hash value is an integer which is used to quickly compare dictionary keys while looking at a dictionary.

What is __ hash __ for in Python?

hash() function is used to fetch the hash values of the python objects. hash() function returns the integer value if the hashable object passed in the function as a parameter. hash() function calls inbuilt __hash__() function which can be overriden for hashing the custom objects.


2 Answers

I'd just do something like (if you don't need thread-safety -- not hard to alter if you DO need thread safety -- and assuming a 32-bit Python version):

import random

_memomask = {}

def hash_function(n):
  mask = _memomask.get(n)
  if mask is None:
    random.seed(n)
    mask = _memomask[n] = random.getrandbits(32)
  def myhash(x):
    return hash(x) ^ mask
  return myhash
like image 173
Alex Martelli Avatar answered Oct 15 '22 17:10

Alex Martelli


As mentioned above, you can use universal hashing for minhash. For example:

import random



def minhash():
    d1 = set(random.randint(0, 2000) for _ in range(1000))
    d2 = set(random.randint(0, 2000) for _ in range(1000))
    jacc_sim = len(d1.intersection(d2)) / len(d1.union(d2))
    print("jaccard similarity: {}".format(jacc_sim))

    N_HASHES = 200
    hash_funcs = []
    for i in range(N_HASHES):
        hash_funcs.append(universal_hashing())

    m1 = [min([h(e) for e in d1]) for h in hash_funcs]
    m2 = [min([h(e) for e in d2]) for h in hash_funcs]
    minhash_sim = sum(int(m1[i] == m2[i]) for i in range(N_HASHES)) / N_HASHES
    print("min-hash similarity: {}".format(minhash_sim))



def universal_hashing():
    def rand_prime():
        while True:
            p = random.randrange(2 ** 32, 2 ** 34, 2)
            if all(p % n != 0 for n in range(3, int((p ** 0.5) + 1), 2)):
                return p
    m = 2 ** 32 - 1
    p = rand_prime()
    a = random.randint(0, p)
    if a % 2 == 0:
        a += 1
    b = random.randint(0, p)
    def h(x):
        return ((a * x + b) % p) % m
    return h

Reference

like image 37
k3f9f2kf2 Avatar answered Oct 15 '22 18:10

k3f9f2kf2