Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing Python Dictionary Lookup Speeds by Shortening Key Size?

I'm not clear on what goes on behind the scenes of a dictionary lookup. Does key size factor into the speed of lookup for that key?

Current dictionary keys are between 10-20 long, alphanumeric.

I need to do hundreds of lookups a minute.

If I replace those with smaller key IDs of between 1 & 4 digits will I get faster lookup times? This would mean I would need to add another value in each item the dictionary is holding. Overall the dictionary will be larger.

Also I'll need to change the program to lookup the ID then get the URL associated with the ID.

Am I likely just adding complexity to the program with little benefit?

like image 397
Excelsior Avatar asked Oct 21 '14 21:10

Excelsior


People also ask

How fast are dictionary lookups Python?

When it comes to 10,000,000 items a dictionary lookup can be 585714 times faster than a list lookup. 6.6 or 585714 are just the results of a simple test run with my computer. These may change in other cases.

Which is faster dictionary or list for look up?

It is more efficient to use dictionaries for the lookup of elements as it is faster than a list and takes less time to traverse. Moreover, lists keep the order of the elements while dictionary does not. So, it is wise to use a list data structure when you are concerned with the order of the data elements.

Which is faster tuple or dictionary?

It is well-known that in Python tuples are faster than lists, and dicts are faster than objects.

Is Numpy array faster than dictionary?

Also as expected, the Numpy array performed faster than the dictionary. However, the dictionary performed faster in Python than in Julia.


2 Answers

Dictionaries are hash tables, so looking up a key consists of:

  • Hash the key.
  • Reduce the hash to the table size.
  • Index the table with the result.
  • Compare the looked-up key with the input key.

Normally, this is amortized constant time, and you don't care about anything more than that. There are two potential issues, but they don't come up often.


Hashing the key takes linear time in the length of the key. For, e.g., huge strings, this could be a problem. However, if you look at the source code for most of the important types, including [str/unicode](https://hg.python.org/cpython/file/default/Objects/unicodeobject.c, you'll see that they cache the hash the first time. So, unless you're inputting (or randomly creating, or whatever) a bunch of strings to look up once and then throw away, this is unlikely to be an issue in most real-life programs.

On top of that, 20 characters is really pretty short; you can probably do millions of such hashes per second, not hundreds.

From a quick test on my computer, hashing 20 random letters takes 973ns, hashing a 4-digit number takes 94ns, and hashing a value I've already hashed takes 77ns. Yes, that's nanoseconds.


Meanwhile, "Index the table with the result" is a bit of a cheat. What happens if two different keys hash to the same index? Then "compare the looked-up key" will fail, and… what happens next? CPython's implementation uses probing for this. The exact algorithm is explained pretty nicely in the source. But you'll notice that given really pathological data, you could end up doing a linear search for every single element. This is never going to come up—unless someone can attack your program by explicitly crafting pathological data, in which case it will definitely come up.

Switching from 20-character strings to 4-digit numbers wouldn't help here either. If I'm crafting keys to DoS your system via dictionary collisions, I don't care what your actual keys look like, just what they hash to.


More generally, premature optimization is the root of all evil. This is sometimes misquoted to overstate the point; Knuth was arguing that the most important thing to do is find the 3% of the cases where optimization is important, not that optimization is always a waste of time. But either way, the point is: if you don't know in advance where your program is too slow (and if you think you know in advance, you're usually wrong…), profile it, and then find the part where you get the most bang for your buck. Optimizing one arbitrary piece of your code is likely to have no measurable effect at all.

like image 182
abarnert Avatar answered Sep 29 '22 09:09

abarnert


Python dictionaries are implemented as hash-maps in the background. The key length might have some impact on the performance if, for example, the hash-functions complexity depends on the key-length. But in general the performance impacts will be definitely negligable.

So I'd say there is little to no benefit for the added complexity.

like image 40
torpedro Avatar answered Sep 29 '22 07:09

torpedro