Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Short Python alphanumeric hash with minimal collisions

Tags:

python

hash

I'd like to set non-integer primary keys for a table using some kind of hash function. md5() seems to be kind of long (32-characters).

What are some alternative hash functions that perhaps use every letter in the alphabet as well as integers that are perhaps shorter in string length and have low collision rates?

Thanks!

like image 986
ensnare Avatar asked Mar 24 '10 19:03

ensnare


People also ask

How do you avoid a hash collision in Python?

Chain hashing avoids collision. The idea is to make each cell of hash table point to a linked list of records that have same hash function value. Note: In Linear Probing, whenever a collision occurs, we probe to the next empty slot.

How likely is a hash collision in Python?

The mathematical expectation of the number of collisions for 64 bit hash is 4e-6 .

What is Hashlib SHA256?

Using Python hashlib to Implement SHA256. Python has a built-in library, hashlib , that is designed to provide a common interface to different secure hashing algorithms. The module provides constructor methods for each type of hash. For example, the . sha256() constructor is used to create a SHA256 hash.

What does __ hash __ do in Python?

Call __hash__ on the key to compute the hash of the key. If the key is not hashable raise a TypeError. Store (hash_value, key, value) in an array at location hash_value % len(array) .


2 Answers

Why don't you just truncate SHA1 or MD5? You'll have more collisions then if you didn't truncate, but it's still better than designing your own. Note that you can base64-encode the truncated hash, rather than using hexadecimal. E.g.

import base64 import hashlib hasher = hashlib.sha1("The quick brown fox") base64.urlsafe_b64encode(hasher.digest()[:10]) 

You can truncate as little (including not at all) or as much as you want, as long as you understand the tradeoffs.

EDIT: Since you mentioned URL-safe, you can use urlsafe_b64encode and urlsafe_b64decode, which uses - and _ rather than + and /.

like image 117
Matthew Flaschen Avatar answered Sep 19 '22 20:09

Matthew Flaschen


The smallest builtin hash I am aware of is md5

>>> import hashlib, base64 >>> d=hashlib.md5(b"hello worlds").digest(); d=base64.b64encode(d);  >>> print(d)  b'S27ylES0wiLdFAGdUpFgCQ==' 

Low collision and short are somewhat mutually exclusive due to the birthday paradox

To make it urlsafe you need to use the function from the base64 module

>>> import base64 >>> base64.urlsafe_b64encode(hashlib.md5("hello world").digest()) 'XrY7u-Ae7tCTyyK7j1rNww==' 

However there should be no problem storing the 16 byte md5 digest in the database in binary form.

>>> md5bytes=hashlib.md5("hello world").digest() >>> len(md5bytes) 16 >>> urllib.quote_plus(md5bytes) '%5E%B6%3B%BB%E0%1E%EE%D0%93%CB%22%BB%8FZ%CD%C3' 

Python 2

>>> base64.urlsafe_b64encode(md5bytes) 'XrY7u-Ae7tCTyyK7j1rNww==' 

Python 3

>>> base64.urlsafe_b64encode(md5bytes).decode('ascii') 'XrY7u-Ae7tCTyyK7j1rNww==' 

You can choose either the quote_plus or the urlsafe_b64encode for your url, then decode with the corresponding function unquote_plus or urlsafe_b64decode before you look them up in the database.

like image 35
John La Rooy Avatar answered Sep 19 '22 20:09

John La Rooy