Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Shortest hash in python to name cache files

Tags:

What is the shortest hash (in filename-usable form, like a hexdigest) available in python? My application wants to save cache files for some objects. The objects must have unique repr() so they are used to 'seed' the filename. I want to produce a possibly unique filename for each object (not that many). They should not collide, but if they do my app will simply lack cache for that object (and will have to reindex that object's data, a minor cost for the application).

So, if there is one collision we lose one cache file, but it is the collected savings of caching all objects makes the application startup much faster, so it does not matter much.

Right now I'm actually using abs(hash(repr(obj))); that's right, the string hash! Haven't found any collisions yet, but I would like to have a better hash function. hashlib.md5 is available in the python library, but the hexdigest is really long if put in a filename. Alternatives, with reasonable collision resistance?

Edit: Use case is like this: The data loader gets a new instance of a data-carrying object. Unique types have unique repr. so if a cache file for hash(repr(obj)) exists, I unpickle that cache file and replace obj with the unpickled object. If there was a collision and the cache was a false match I notice. So if we don't have cache or have a false match, I instead init obj (reloading its data).

Conclusions (?)

The str hash in python may be good enough, I was only worried about its collision resistance. But if I can hash 2**16 objects with it, it's going to be more than good enough.

I found out how to take a hex hash (from any hash source) and store it compactly with base64:

# 'h' is a string of hex digits 
bytes = "".join(chr(int(h[i:i+2], 16)) for i in xrange(0, len(h), 2))
hashstr = base64.urlsafe_b64encode(bytes).rstrip("=")
like image 907
u0b34a0f6ae Avatar asked Aug 19 '09 22:08

u0b34a0f6ae


People also ask

Does Python cache hash?

Sure, it's fine to cache the hash value. In fact, Python does so for strings itself. The trade-off is between the speed of the hash calculation and the space it takes to save the hash value. That trade-off is for example why tuples don't cache their hash value, but strings do (see request for enhancement #1462796).

What does @cache mean in Python?

Caching is an optimization technique that you can use in your applications to keep recent or often-used data in memory locations that are faster or computationally cheaper to access than their source. Imagine you're building a newsreader application that fetches the latest news from different sources.

What is hash collision in Python?

Hash tables are used to store data using a pair of keys and values. A hash function uses a mathematical algorithm to calculate the hash value. A collision occurs when the same hash value is generated for more than one value. Chaining solves collision by creating linked lists.


2 Answers

The birthday paradox applies: given a good hash function, the expected number of hashes before a collision occurs is about sqrt(N), where N is the number of different values that the hash function can take. (The wikipedia entry I've pointed to gives the exact formula). So, for example, if you want to use no more than 32 bits, your collision worries are serious for around 64K objects (i.e., 2**16 objects -- the square root of the 2**32 different values your hash function can take). How many objects do you expect to have, as an order of magnitude?

Since you mention that a collision is a minor annoyance, I recommend you aim for a hash length that's roughly the square of the number of objects you'll have, or a bit less but not MUCH less than that.

You want to make a filename - is that on a case-sensitive filesystem, as typical on Unix, or do you have to cater for case-insensitive systems too? This matters because you aim for short filenames, but the number of bits per character you can use to represent your hash as a filename changes dramatically on case-sensive vs insensitive systems.

On a case-sensitive system, you can use the standard library's base64 module (I recommend the "urlsafe" version of the encoding, i.e. this function, as avoiding '/' characters that could be present in plain base64 is important in Unix filenames). This gives you 6 usable bits per character, much better than the 4 bits/char in hex.

Even on a case-insensitive system, you can still do better than hex -- use base64.b32encode and get 5 bits per character.

These functions take and return strings; use the struct module to turn numbers into strings if your chosen hash function generates numbers.

If you do have a few tens of thousands of objects I think you'll be fine with builtin hash (32 bits, so 6-7 characters depending on your chosen encoding). For a million objects you'd want 40 bits or so (7 or 8 characters) -- you can fold (xor, don't truncate;-) a sha256 down to a long with a reasonable number of bits, say 128 or so, and use the % operator to cut it further to your desired length before encoding.

like image 155
Alex Martelli Avatar answered Sep 27 '22 19:09

Alex Martelli


The builtin hash function of strings is fairly collision free, and also fairly short. It has 2**32 values, so it is fairly unlikely that you encounter collisions (if you use its abs value, it will have only 2**31 values).

You have been asking for the shortest hash function. That would certainly be

def hash(s):
  return 0

but I guess you didn't really mean it that way...

like image 34
Martin v. Löwis Avatar answered Sep 27 '22 17:09

Martin v. Löwis