I have a (potentially quite big) dictionary and a list of 'possible' keys. I want to quickly find which of the keys have matching values in the dictionary. I've found lots of discussion of single dictionary values here and here, but no discussion of speed or multiple entries. I've come up with four ways, and for the three that work best I compare their speed on different sample sizes below - are there better methods? If people can suggest sensible contenders I'll subject them to the analysis below as well. Sample lists and dictionaries are created as follows: <pre class="prettyprint"><code>import cProfile from random import randint length = 100000 listOfRandomInts = [randint(0,length*length/10-1) for x in range(length)] dictionaryOfRandomInts = {randint(0,length*length/10-1): "It's here" for x in range(length)} </code></pre> Method 1: the <code>'in'</code> keyword: <pre class="prettyprint"><code>def way1(theList,theDict): resultsList = [] for listItem in theList: if listItem in theDict: resultsList.append(theDict[listItem]) return resultsList cProfile.run('way1(listOfRandomInts,dictionaryOfRandomInts)') </code></pre> 32 function calls in 0.018 seconds Method 2: error handling: <pre class="prettyprint"><code>def way2(theList,theDict): resultsList = [] for listItem in theList: try: resultsList.append(theDict[listItem]) except: ; return resultsList cProfile.run('way2(listOfRandomInts,dictionaryOfRandomInts)') </code></pre> 32 function calls in 0.087 seconds Method 3: set intersection: <pre class="prettyprint"><code>def way3(theList,theDict): return list(set(theList).intersection(set(theDict.keys()))) cProfile.run('way3(listOfRandomInts,dictionaryOfRandomInts)') </code></pre> 26 function calls in 0.046 seconds Method 4: Naive use of <code>dict.keys()</code>: This is a cautionary tale - it was my first attempt and BY FAR the slowest! <pre class="prettyprint"><code>def way4(theList,theDict): resultsList = [] keys = theDict.keys() for listItem in theList: if listItem in keys: resultsList.append(theDict[listItem]) return resultsList cProfile.run('way4(listOfRandomInts,dictionaryOfRandomInts)') </code></pre> 12 function calls in 248.552 seconds EDIT: Bringing the suggestions given in the answers into the same framework that I've used for consistency. Many have noted that more performance gains can be achieved in Python 3.x, particularly list comprehension-based methods. Many thanks for all of the help! Method 5: Better way of performing intersection (thanks jonrsharpe): <pre class="prettyprint"><code>def way5(theList, theDict): return = list(set(theList).intersection(theDict)) </code></pre> 25 function calls in 0.037 seconds Method 6: List comprehension (thanks jonrsharpe): <pre class="prettyprint"><code>def way6(theList, theDict): return [item for item in theList if item in theDict] </code></pre> 24 function calls in 0.020 seconds Method 7: Using the <code>&</code> keyword (thanks jonrsharpe): <pre class="prettyprint"><code>def way7(theList, theDict): return list(theDict.viewkeys() & theList) </code></pre> 25 function calls in 0.026 seconds For methods 1-3 and 5-7 I timed them as above with list/dictionaries of length 1000, 10000, 100000, 1000000, 10000000 and 100000000 and show a log-log plot of time taken. Across all lengths the intersection and in-statement method perform better. The gradients are all about 1 (maybe a bit higher), indicating O(n) or perhaps slightly super-linear scaling. <img src="https://i.stack.imgur.com/sG5iU.png" alt="Log-Log plot comparing time-scaling of the 6 sensible methods with list/dict length">

Of a couple of additional methods I've tried, the fastest was a simple list comprehension: <pre class="prettyprint"><code>def way6(theList, theDict): return [item for item in theList if item in theDict] </code></pre> This runs the same process as your fastest approach, <code>way1</code>, but more quickly. For comparison, the quickest <code>set</code>-based way was <pre class="prettyprint"><code>def way5(theList, theDict): return list(set(theList).intersection(theDict)) </code></pre> <code>timeit</code> results: <pre class="prettyprint"><code>>>> import timeit >>> setup = """from __main__ import way1, way5, way6 from random import randint length = 100000 listOfRandomInts = [randint(0,length*length/10-1) for x in range(length)] dictionaryOfRandomInts = {randint(0,length*length/10-1): "It's here" for x in range(length)} """ >>> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000) 14.550477756582723 >>> timeit.timeit('way5(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000) 19.597916393388232 >>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000) 13.652289059326904 </code></pre> <hr> Having added @abarnert's suggestion: <pre class="prettyprint"><code>def way7(theList, theDict): return list(theDict.viewkeys() & theList) </code></pre> and re-run the timing I now get: <pre class="prettyprint"><code>>>> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000) 13.110055883138497 >>> timeit.timeit('way5(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000) 17.292466681101036 >>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000) 14.351759544463917 >>> timeit.timeit('way7(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000) 17.206370930653392 </code></pre> <code>way1</code> and <code>way6</code> have switched places, so I re-ran again: <pre class="prettyprint"><code>>>> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000) 13.648176054011941 >>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000) 13.847062579316628 </code></pre> So it looks like the set approach is slower than the list, but the difference between the list and list comprehension is (surprisingly, to me at least) is a bit variable. I'd say just pick one, and not worry about it unless it becomes a real bottleneck later.

Finding all keys in a dictionary from a given list QUICKLY

Tags:

I have a (potentially quite big) dictionary and a list of 'possible' keys. I want to quickly find which of the keys have matching values in the dictionary. I've found lots of discussion of single dictionary values here and here, but no discussion of speed or multiple entries.

I've come up with four ways, and for the three that work best I compare their speed on different sample sizes below - are there better methods? If people can suggest sensible contenders I'll subject them to the analysis below as well.

Sample lists and dictionaries are created as follows:

import cProfile
from random import randint

length = 100000

listOfRandomInts = [randint(0,length*length/10-1) for x in range(length)]
dictionaryOfRandomInts = {randint(0,length*length/10-1): "It's here" for x in range(length)}

Method 1: the 'in' keyword:

def way1(theList,theDict):
    resultsList = []
    for listItem in theList:
        if listItem in theDict:
            resultsList.append(theDict[listItem])
    return resultsList

cProfile.run('way1(listOfRandomInts,dictionaryOfRandomInts)')

32 function calls in 0.018 seconds

Method 2: error handling:

def way2(theList,theDict):
    resultsList = []
    for listItem in theList:
        try:
            resultsList.append(theDict[listItem])
        except:
            ;
    return resultsList

cProfile.run('way2(listOfRandomInts,dictionaryOfRandomInts)')

32 function calls in 0.087 seconds

Method 3: set intersection:

def way3(theList,theDict):
    return list(set(theList).intersection(set(theDict.keys())))

cProfile.run('way3(listOfRandomInts,dictionaryOfRandomInts)')

26 function calls in 0.046 seconds

Method 4: Naive use of dict.keys():

This is a cautionary tale - it was my first attempt and BY FAR the slowest!

def way4(theList,theDict):
    resultsList = []
    keys = theDict.keys()
    for listItem in theList:
        if listItem in keys:
            resultsList.append(theDict[listItem])
    return resultsList

cProfile.run('way4(listOfRandomInts,dictionaryOfRandomInts)')

12 function calls in 248.552 seconds

EDIT: Bringing the suggestions given in the answers into the same framework that I've used for consistency. Many have noted that more performance gains can be achieved in Python 3.x, particularly list comprehension-based methods. Many thanks for all of the help!

Method 5: Better way of performing intersection (thanks jonrsharpe):

def way5(theList, theDict):
    return = list(set(theList).intersection(theDict))

25 function calls in 0.037 seconds

Method 6: List comprehension (thanks jonrsharpe):

def way6(theList, theDict):
    return [item for item in theList if item in theDict]

24 function calls in 0.020 seconds

Method 7: Using the & keyword (thanks jonrsharpe):

def way7(theList, theDict):
    return list(theDict.viewkeys() & theList)

25 function calls in 0.026 seconds

For methods 1-3 and 5-7 I timed them as above with list/dictionaries of length 1000, 10000, 100000, 1000000, 10000000 and 100000000 and show a log-log plot of time taken. Across all lengths the intersection and in-statement method perform better. The gradients are all about 1 (maybe a bit higher), indicating O(n) or perhaps slightly super-linear scaling.

Log-Log plot comparing time-scaling of the 6 sensible methods with list/dict length

871

asked Apr 30 '15 10:04

StackG

1 Answers

Of a couple of additional methods I've tried, the fastest was a simple list comprehension:

def way6(theList, theDict):
    return [item for item in theList if item in theDict]

This runs the same process as your fastest approach, way1, but more quickly. For comparison, the quickest set-based way was

def way5(theList, theDict):
    return list(set(theList).intersection(theDict))

timeit results:

>>> import timeit
>>> setup = """from __main__ import way1, way5, way6
from random import randint
length = 100000
listOfRandomInts = [randint(0,length*length/10-1) for x in range(length)]
dictionaryOfRandomInts = {randint(0,length*length/10-1): "It's here" for x in range(length)}
"""
>>> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
14.550477756582723
>>> timeit.timeit('way5(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
19.597916393388232
>>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
13.652289059326904

Having added @abarnert's suggestion:

def way7(theList, theDict):
    return list(theDict.viewkeys() & theList)

and re-run the timing I now get:

>>> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
13.110055883138497
>>> timeit.timeit('way5(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
17.292466681101036
>>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
14.351759544463917
>>> timeit.timeit('way7(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
17.206370930653392

way1 and way6 have switched places, so I re-ran again:

>>> timeit.timeit('way1(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
13.648176054011941
>>> timeit.timeit('way6(listOfRandomInts,dictionaryOfRandomInts)', setup=setup, number=1000)
13.847062579316628

So it looks like the set approach is slower than the list, but the difference between the list and list comprehension is (surprisingly, to me at least) is a bit variable. I'd say just pick one, and not worry about it unless it becomes a real bottleneck later.

192

answered Oct 05 '22 10:10

jonrsharpe

Related questions
                            
                                System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown in angularjs
                            
                                BASH: print output on one line
                            
                                What could be the reason for such kind of Azure Web Site hangs?
                            
                                How app can access files on USB OTG storages in Android 6.0 (API level 23) without root?
                            
                                Is it possible to make Spring @Import or @Configuration parametrized?
                            
                                What to do if a posix close call fails?
                            
                                Nodejs throw exception
                            
                                Flexbox sizing doesn't work correctly when using FontAwesome
                            
                                What is the best way to call into Swift from C?
                            
                                How do you activate an Anaconda environment within a Python Script?
                            
                                RxSwift - Misundestanding some concept
                            
                                r function unzip error 1 in extracting from zip file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With