The program that I write processes a large number of objects, each with its own unique id, which itself is a string of complicated structure (dozen of unique fields of the object joined by some separator) and big length.
Since I have to process a lot of these objects fast and I need to reffer to them by id while processing and I have no power to change their format (I retrieve them externally, by network), I want to map their complicated string id to my own internal integer id and further use it for comparison, for transfering them further to other processes, etc.
What I'm going to do is to use a simple dict with keys as string id of the object and integer values as my internal integer id of it.
My question is: is there a better way in Python to do this? May be there is a way to calculate some hash manually, whatever? May be the dict is not the best solution?
As for numbers: there are about 100K of such unique objects in the system at a time, so the integer capacity is more than enough.
For comparison purposes, you can intern
the strings and then compare them with is
instead of ==
, which does a simple pointer comparison and should be as fast as (or faster than) comparing two integers:
>>> 'foo' * 100 is 'foo' * 100
False
>>> intern('foo' * 100) is intern('foo' * 100)
True
intern
guarantees that id(intern(A)) == id(intern(B))
iff A == B
. Be sure to intern
any string as soon as it is input. Note that intern
is called sys.intern
in Python 3.x.
But when you have to pass these strings to other processes, your dict
solution seems best. What I usually do in such situations is
str_to_id = {}
for s in strings:
str_to_id.setdefault(s, len(str_to_id))
so the integer capacity is more than enough
Python integers are bigints, so that should never be a problem.
How about the hash
function?
In [130]: hash
Out[130]: <function hash>
In [131]: hash('foo')
Out[131]: -740391237
There is no need to store hashes (unless you want to): the point is that they are equal for objects that are value-equal (although the reverse may not be true - there are no doubt unequal strings or other objects that hash to the same value; such is the nature of hashing).
If you know the range of your keys (and you probably do), you could also use a perfect hash function generator. This is apparently one for python: http://ilan.schnell-web.net/prog/perfect-hash/
Perfect hashes guarantee that keys within the specified range have a bijective relationship with their hash value.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With