Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimize the performance of dictionary membership for a list of Keys

I am trying to write a code which should return true if any element of list is present in a dictionary. Performance of this piece is really important. I know I can just loop over list and break if I find the first search hit. Is there any faster or more Pythonic way for this than given below?

for x in someList:
     if x in someDict:
           return True
return False

EDIT: I am using Python 2.7. My first preference would be a faster method.

like image 448
Naman Avatar asked Apr 24 '15 05:04

Naman


2 Answers

Use of builtin any can have some performance edge over two loops

any(x in someDict for x in someList)

but you might need to measure your mileage. If your list and dict remains pretty static and you have to perform the comparison multiple times, you may consider using set

someSet = set(someList) 
someDict.viewkeys() & someSet 

Note Python 3.X, by default returns views rather than a sequence, so it would be straight forward when using Python 3.X

someSet = set(someList) 
someDict.keys() & someSet 

In both the above cases you can wrap the result with a bool to get a boolean result

bool(someDict.keys() & set(someSet ))

Heretic Note

My curiosity got the better of me and I timed all the proposed solutions. It seems that your original solution is better performance wise. Here is the result

Sample Randomly generated Input

def test_data_gen():
    from random import sample
    for i in range(1,5):
        n = 10**i
        population = set(range(1,100000))
        some_list = sample(list(population),n)
        population.difference_update(some_list)
        some_dict = dict(zip(sample(population,n),
                             sample(range(1,100000),n)))
        yield "Population Size of {}".format(n), (some_list, some_dict), {}

The Test Engine

I rewrote the Test Part of the answer as it was messy and the answer was receiving quite a decent attention. I created a timeit compare python module and moved it onto github

The Test Result

Timeit repeated for 10 times
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
======================================
Test Run for Population Size of 10
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_nested           |0.000011  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     2|foo_ifilter_any      |0.000014  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     3|foo_ifilter_not_not  |0.000015  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     4|foo_imap_any         |0.000018  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     5|foo_any              |0.000019  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     6|foo_ifilter_next     |0.000022  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     7|foo_set_ashwin       |0.000024  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |0.000047  |some_dict.viewkeys() & set(some_list )
======================================
Test Run for Population Size of 100
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_ifilter_any      |0.000071  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     2|foo_nested           |0.000072  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     3|foo_ifilter_not_not  |0.000073  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     4|foo_ifilter_next     |0.000076  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     5|foo_imap_any         |0.000082  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     6|foo_any              |0.000092  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     7|foo_set_ashwin       |0.000170  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |0.000638  |some_dict.viewkeys() & set(some_list )
======================================
Test Run for Population Size of 1000
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_ifilter_not_not  |0.000746  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     2|foo_ifilter_any      |0.000746  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     3|foo_ifilter_next     |0.000752  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     4|foo_nested           |0.000771  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     5|foo_set_ashwin       |0.000838  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     6|foo_imap_any         |0.000842  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     7|foo_any              |0.000933  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |0.001702  |some_dict.viewkeys() & set(some_list )
======================================
Test Run for Population Size of 10000
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_nested           |0.007195  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     2|foo_ifilter_next     |0.007410  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     3|foo_ifilter_any      |0.007491  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     4|foo_ifilter_not_not  |0.007671  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     5|foo_set_ashwin       |0.008385  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     6|foo_imap_any         |0.011327  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     7|foo_any              |0.011533  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |0.018313  |some_dict.viewkeys() & set(some_list )
Timeit repeated for 100 times
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
======================================
Test Run for Population Size of 10
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_nested           |0.000098  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     2|foo_ifilter_any      |0.000124  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     3|foo_ifilter_not_not  |0.000131  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     4|foo_imap_any         |0.000142  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     5|foo_ifilter_next     |0.000151  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     6|foo_any              |0.000158  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     7|foo_set_ashwin       |0.000186  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |0.000496  |some_dict.viewkeys() & set(some_list )
======================================
Test Run for Population Size of 100
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_ifilter_any      |0.000661  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     2|foo_ifilter_not_not  |0.000677  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     3|foo_nested           |0.000683  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     4|foo_ifilter_next     |0.000684  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     5|foo_imap_any         |0.000762  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     6|foo_any              |0.000854  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     7|foo_set_ashwin       |0.001291  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |0.005018  |some_dict.viewkeys() & set(some_list )
======================================
Test Run for Population Size of 1000
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_ifilter_any      |0.007585  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     2|foo_nested           |0.007713  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     3|foo_set_ashwin       |0.008256  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     4|foo_ifilter_not_not  |0.008526  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     5|foo_any              |0.009422  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     6|foo_ifilter_next     |0.010259  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     7|foo_imap_any         |0.011414  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |0.019862  |some_dict.viewkeys() & set(some_list )
======================================
Test Run for Population Size of 10000
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_imap_any         |0.082221  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     2|foo_ifilter_any      |0.083573  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     3|foo_nested           |0.095736  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     4|foo_set_ashwin       |0.103427  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     5|foo_ifilter_next     |0.104589  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     6|foo_ifilter_not_not  |0.117974  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     7|foo_any              |0.127739  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |0.208228  |some_dict.viewkeys() & set(some_list )
Timeit repeated for 1000 times
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
======================================
Test Run for Population Size of 10
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_nested           |0.000953  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     2|foo_ifilter_any      |0.001134  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     3|foo_ifilter_not_not  |0.001213  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     4|foo_ifilter_next     |0.001340  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     5|foo_imap_any         |0.001407  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     6|foo_any              |0.001535  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     7|foo_set_ashwin       |0.002252  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |0.004701  |some_dict.viewkeys() & set(some_list )
======================================
Test Run for Population Size of 100
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_ifilter_any      |0.006209  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     2|foo_ifilter_next     |0.006411  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     3|foo_ifilter_not_not  |0.006657  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     4|foo_nested           |0.006727  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     5|foo_imap_any         |0.007562  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     6|foo_any              |0.008262  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     7|foo_set_ashwin       |0.012260  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |0.046773  |some_dict.viewkeys() & set(some_list )
======================================
Test Run for Population Size of 1000
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_ifilter_not_not  |0.071888  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     2|foo_ifilter_next     |0.072150  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     3|foo_nested           |0.073382  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     4|foo_ifilter_any      |0.075698  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     5|foo_set_ashwin       |0.077367  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     6|foo_imap_any         |0.090623  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     7|foo_any              |0.093301  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |0.177051  |some_dict.viewkeys() & set(some_list )
======================================
Test Run for Population Size of 10000
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_nested           |0.701317  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     2|foo_ifilter_next     |0.706156  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     3|foo_ifilter_any      |0.723368  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     4|foo_ifilter_not_not  |0.746650  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     5|foo_set_ashwin       |0.776704  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     6|foo_imap_any         |0.832117  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     7|foo_any              |0.881777  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |1.665962  |some_dict.viewkeys() & set(some_list )
Timeit repeated for 10000 times
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
======================================
Test Run for Population Size of 10
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_nested           |0.010581  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     2|foo_ifilter_any      |0.013512  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     3|foo_imap_any         |0.015321  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     4|foo_ifilter_not_not  |0.017680  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     5|foo_ifilter_next     |0.019334  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     6|foo_any              |0.026274  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     7|foo_set_ashwin       |0.030881  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |0.053605  |some_dict.viewkeys() & set(some_list )
======================================
Test Run for Population Size of 100
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_nested           |0.070194  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     2|foo_ifilter_not_not  |0.078524  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     3|foo_ifilter_any      |0.079499  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     4|foo_imap_any         |0.087349  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     5|foo_ifilter_next     |0.093970  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     6|foo_any              |0.097948  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     7|foo_set_ashwin       |0.130725  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |0.480841  |some_dict.viewkeys() & set(some_list )
======================================
Test Run for Population Size of 1000
======================================
|Rank  |FunctionName         |Result    |Description
+------+---------------------+----------+-----------------------------------------------
|     1|foo_ifilter_any      |0.754491  |any(ifilter(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     2|foo_ifilter_not_not  |0.756253  |not not next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     3|foo_ifilter_next     |0.771382  |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+----------+-----------------------------------------------
|     4|foo_nested           |0.787152  |Original OPs Code
+------+---------------------+----------+-----------------------------------------------
|     5|foo_set_ashwin       |0.818520  |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+----------+-----------------------------------------------
|     6|foo_imap_any         |0.902947  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+----------+-----------------------------------------------
|     7|foo_any              |1.001810  |any(x in some_dict for x in some_list)
+------+---------------------+----------+-----------------------------------------------
|     8|foo_set              |2.012781  |some_dict.viewkeys() & set(some_list )
=======================================
Test Run for Population Size of 10000
=======================================
|Rank  |FunctionName         |Result     |Description
+------+---------------------+-----------+-----------------------------------------------
|     1|foo_imap_any         |10.071469  |any(imap(some_dict.__contains__, some_list))
+------+---------------------+-----------+-----------------------------------------------
|     2|foo_any              |11.127034  |any(x in some_dict for x in some_list)
+------+---------------------+-----------+-----------------------------------------------
|     3|foo_set              |18.881414  |some_dict.viewkeys() & set(some_list )
+------+---------------------+-----------+-----------------------------------------------
|     4|foo_nested           |8.731133   |Original OPs Code
+------+---------------------+-----------+-----------------------------------------------
|     5|foo_ifilter_not_not  |9.019190   |not not next(ifilter(some_dict.__contains__...
+------+---------------------+-----------+-----------------------------------------------
|     6|foo_ifilter_next     |9.189966   |bool(next(ifilter(some_dict.__contains__...
+------+---------------------+-----------+-----------------------------------------------
|     7|foo_set_ashwin       |9.363886   |not set(some_dct).isdisjoint(some_lst)
+------+---------------------+-----------+-----------------------------------------------
|     8|foo_ifilter_any      |9.442759   |any(ifilter(some_dict.__contains__, some_list))

And a Graphical Comparison from the above referred module

enter image description hereenter image description hereenter image description hereenter image description here

Conclusion

Premature optimization is evil. It is evident that none of the solutions have optimal performance when the test domain varies. Depending on population size and frequency of iteration, performance of solutions varies considerably. The result again speaks out about the fact that in Python, one should ensure that the code should be readable rather than ensuring that the code is either nifty or optimized for performance for certain cases, but then it may not be scalable.

Note There were some doubts on why not using ifilter performs better than the rest

"In Abhit's answer, he timed the different approaches and found that ifilter/next was not the fastest; any idea why this would be the case? "

It is a known fact that in python, there is an overhead when calling C functions, and if the population size is low but the frequency of iteration is high, the accumulation of C function call overhead would slowly show up. As can be seen in the graphs, where population size is low but iteration is high, using ifilter, performance deviates considerably.

like image 115
Abhijit Avatar answered Sep 19 '22 21:09

Abhijit


The cleanest and fastest way is to use any() and itertools.ifilter():

any(ifilter(someDict.__contains__, someList))

This code uses:

  • a bound method, someDict.__contains__ as the predicate
  • the ifilter() itertool lazily scans for a true result at C speed
  • the any() built-in drives the filter to find a single matching occurrence or to return False if the iterator is exhausted.

You could also use itertools.imap() instead of ifilter().

like image 34
Raymond Hettinger Avatar answered Sep 23 '22 21:09

Raymond Hettinger