I have a (potentially quite big) dictionary and a list of 'possible' keys. I want to quickly find which of the keys have matching values in the dictionary. I've found lots of discussion of single dictionary values here and here, but no discussion of speed or multiple entries.
I've come up with four ways, and for the three that work best I compare their speed on different sample sizes below - are there better methods? If people can suggest sensible contenders I'll subject them to the analysis below as well.
Sample lists and dictionaries are created as follows:
import cProfile
from random import randint
length = 100000
listOfRandomInts = [randint(0,length*length/10-1) for x in range(length)]
dictionaryOfRandomInts = {randint(0,length*length/10-1): "It's here" for x in range(length)}
Method 1: the 'in'
keyword:
def way1(theList,theDict):
resultsList = []
for listItem in theList:
if listItem in theDict:
resultsList.append(theDict[listItem])
return resultsList
cProfile.run('way1(listOfRandomInts,dictionaryOfRandomInts)')
32 function calls in 0.018 seconds
Method 2: error handling:
def way2(theList,theDict):
resultsList = []
for listItem in theList:
try:
resultsList.append(theDict[listItem])
except:
;
return resultsList
cProfile.run('way2(listOfRandomInts,dictionaryOfRandomInts)')
32 function calls in 0.087 seconds
Method 3: set intersection:
def way3(theList,theDict):
return list(set(theList).intersection(set(theDict.keys())))
cProfile.run('way3(listOfRandomInts,dictionaryOfRandomInts)')
26 function calls in 0.046 seconds
Method 4: Naive use of dict.keys()
:
This is a cautionary tale - it was my first attempt and BY FAR the slowest!
def way4(theList,theDict):
resultsList = []
keys = theDict.keys()
for listItem in theList:
if listItem in keys:
resultsList.append(theDict[listItem])
return resultsList
cProfile.run('way4(listOfRandomInts,dictionaryOfRandomInts)')
12 function calls in 248.552 seconds
EDIT: Bringing the suggestions given in the answers into the same framework that I've used for consistency. Many have noted that more performance gains can be achieved in Python 3.x, particularly list comprehension-based methods. Many thanks for all of the help!
Method 5: Better way of performing intersection (thanks jonrsharpe):
def way5(theList, theDict):
return = list(set(theList).intersection(theDict))
25 function calls in 0.037 seconds
Method 6: List comprehension (thanks jonrsharpe):
def way6(theList, theDict):
return [item for item in theList if item in theDict]
24 function calls in 0.020 seconds
Method 7: Using the &
keyword (thanks jonrsharpe):
def way7(theList, theDict):
return list(theDict.viewkeys() & theList)
25 function calls in 0.026 seconds
For methods 1-3 and 5-7 I timed them as above with list/dictionaries of length 1000, 10000, 100000, 1000000, 10000000 and 100000000 and show a log-log plot of time taken. Across all lengths the intersection and in-statement method perform better. The gradients are all about 1 (maybe a bit higher), indicating O(n) or perhaps slightly super-linear scaling.
You can get all the keys in the dictionary as a Python List. dict. keys() returns an iterable of type dict_keys() . You can convert this into a list using list() .
keys() method in Python Dictionary, returns a view object that displays a list of all the keys in the dictionary in order of insertion.
Of a couple of additional methods I've tried, the fastest was a simple list comprehension:
def way6(theList, theDict):
return [item for item in theList if item in theDict]
This runs the same process as your fastest approach, way1
, but more quickly. For comparison, the quickest set
-based way was
def way5(theList, theDict):
return list(set(theList).intersection(theDict))
timeit
results:
>>> import timeit
>>> setup = """from __main__ import way1, way5, way6
from random import randint
length = 100000
listOfRandomInts = [randint(0,length*length/10-1) for x in range(length)]
dictionaryOfRandomInts = {randint(0,length*length/10-1): "It's here" for x in range(length)}
"""
>>> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
14.550477756582723
>>> timeit.timeit('way5(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
19.597916393388232
>>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
13.652289059326904
Having added @abarnert's suggestion:
def way7(theList, theDict):
return list(theDict.viewkeys() & theList)
and re-run the timing I now get:
>>> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
13.110055883138497
>>> timeit.timeit('way5(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
17.292466681101036
>>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
14.351759544463917
>>> timeit.timeit('way7(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
17.206370930653392
way1
and way6
have switched places, so I re-ran again:
>>> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
13.648176054011941
>>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
13.847062579316628
So it looks like the set approach is slower than the list, but the difference between the list and list comprehension is (surprisingly, to me at least) is a bit variable. I'd say just pick one, and not worry about it unless it becomes a real bottleneck later.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With