Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

The most effective way to assign unique integer id to a string?

Tags:

python

hash

The program that I write processes a large number of objects, each with its own unique id, which itself is a string of complicated structure (dozen of unique fields of the object joined by some separator) and big length.

Since I have to process a lot of these objects fast and I need to reffer to them by id while processing and I have no power to change their format (I retrieve them externally, by network), I want to map their complicated string id to my own internal integer id and further use it for comparison, for transfering them further to other processes, etc.

What I'm going to do is to use a simple dict with keys as string id of the object and integer values as my internal integer id of it.

My question is: is there a better way in Python to do this? May be there is a way to calculate some hash manually, whatever? May be the dict is not the best solution?

As for numbers: there are about 100K of such unique objects in the system at a time, so the integer capacity is more than enough.

like image 980
lithuak Avatar asked Mar 08 '12 14:03

lithuak


2 Answers

For comparison purposes, you can intern the strings and then compare them with is instead of ==, which does a simple pointer comparison and should be as fast as (or faster than) comparing two integers:

>>> 'foo' * 100 is 'foo' * 100
False
>>> intern('foo' * 100) is intern('foo' * 100)
True

intern guarantees that id(intern(A)) == id(intern(B)) iff A == B. Be sure to intern any string as soon as it is input. Note that intern is called sys.intern in Python 3.x.

But when you have to pass these strings to other processes, your dict solution seems best. What I usually do in such situations is

str_to_id = {}
for s in strings:
    str_to_id.setdefault(s, len(str_to_id))

so the integer capacity is more than enough

Python integers are bigints, so that should never be a problem.

like image 129
Fred Foo Avatar answered Oct 16 '22 11:10

Fred Foo


How about the hash function?

In [130]: hash
Out[130]: <function hash>

In [131]: hash('foo')
Out[131]: -740391237

There is no need to store hashes (unless you want to): the point is that they are equal for objects that are value-equal (although the reverse may not be true - there are no doubt unequal strings or other objects that hash to the same value; such is the nature of hashing).

If you know the range of your keys (and you probably do), you could also use a perfect hash function generator. This is apparently one for python: http://ilan.schnell-web.net/prog/perfect-hash/

Perfect hashes guarantee that keys within the specified range have a bijective relationship with their hash value.

like image 21
Marcin Avatar answered Oct 16 '22 11:10

Marcin