Which one of these is faster? Is one "better"? Basically I'll have two sets and I want to eventually get one match from between the two lists. So really I suppose the for loop is more like: <pre class="prettyprint"><code>for object in set: if object in other_set: return object </code></pre> Like I said - I only need one match, but I'm not sure how <code>intersection()</code> is handled, so I don't know if its any better. Also, if it helps, the <code>other_set</code> is a list near 100,000 components and the <code>set</code> is maybe a few hundred, max few thousand.

<pre class="prettyprint"><code>from timeit import timeit setup = """ from random import sample, shuffle a = range(100000) b = sample(a, 1000) a.reverse() """ forin = setup + """ def forin(): # a = set(a) for obj in b: if obj in a: return obj """ setin = setup + """ def setin(): # original method: # return tuple(set(a) & set(b))[0] # suggested in comment, doesn't change conclusion: return next(iter(set(a) & set(b))) """ print timeit("forin()", forin, number = 100) print timeit("setin()", setin, number = 100) </code></pre> Times: <pre class="prettyprint"><code>>>> 0.0929054012768 0.637904308732 >>> 0.160845057616 1.08630760484 >>> 0.322059185123 1.10931801261 >>> 0.0758695262169 1.08920981403 >>> 0.247866360526 1.07724461708 >>> 0.301856152688 1.07903130641 </code></pre> Making them into sets in the setup and running 10000 runs instead of 100 yields <pre class="prettyprint"><code>>>> 0.000413064976328 0.152831597075 >>> 0.00402408388788 1.49093627898 >>> 0.00394538156695 1.51841512101 >>> 0.00397715579584 1.52581949403 >>> 0.00421472926155 1.53156769646 </code></pre> So your version is much faster whether or not it makes sense to convert them to sets.

Speed differences between intersection() and 'object for object in set if object in other_set'

Tags:

performance

python

data-structures

intersection

set

Which one of these is faster? Is one "better"? Basically I'll have two sets and I want to eventually get one match from between the two lists. So really I suppose the for loop is more like:

for object in set:
    if object in other_set:
        return object

Like I said - I only need one match, but I'm not sure how intersection() is handled, so I don't know if its any better. Also, if it helps, the other_set is a list near 100,000 components and the set is maybe a few hundred, max few thousand.

681

asked Jul 25 '11 19:07

Jon Phenow

2 Answers

from timeit import timeit

setup = """
from random import sample, shuffle
a = range(100000)
b = sample(a, 1000)
a.reverse()
"""

forin = setup + """
def forin():
    # a = set(a)
    for obj in b:
        if obj in a:
            return obj
"""

setin = setup + """
def setin():
    # original method:
    # return tuple(set(a) & set(b))[0]
    # suggested in comment, doesn't change conclusion:
    return next(iter(set(a) & set(b)))
"""

print timeit("forin()", forin, number = 100)
print timeit("setin()", setin, number = 100)

Times:

>>>
0.0929054012768
0.637904308732
>>>
0.160845057616
1.08630760484
>>>
0.322059185123
1.10931801261
>>>
0.0758695262169
1.08920981403
>>>
0.247866360526
1.07724461708
>>>
0.301856152688
1.07903130641

Making them into sets in the setup and running 10000 runs instead of 100 yields

>>>
0.000413064976328
0.152831597075
>>>
0.00402408388788
1.49093627898
>>>
0.00394538156695
1.51841512101
>>>
0.00397715579584
1.52581949403
>>>
0.00421472926155
1.53156769646

So your version is much faster whether or not it makes sense to convert them to sets.

163

answered Nov 10 '22 04:11

agf

I realize this is a older post. But, I arrived here looking for performance speeds comparing using intersection vs in and thought it'd be worth adding more info. The answers above were great, but left me unclear as to the actual best path forward.

The "first result" solution doesn't solve for my use case specifically.

Instead, I wanted to know how the different implementations would perform, producing identical results sets, using discrete approaches. Not just the first single intersected value. As such, below I've included code to perform an evaluation of the options with a 1000 loop test. Contrary to what @agf posted, using sets is far faster when the desired output is a list of matches.

My results were:

runForin took 132851.600ms
runForinBlist took 37700.916ms
True
runForInListComp took 132783.147ms
True
runForinSet took 780.919ms
True
runSetIntersection took 760.980ms (WINNER)
True
runSetin took 771.921ms
True

Here's the code. Hope it helps someone. Note: I also evaluated the blist (http://stutzbachenterprises.com/blist/blist.html) library as it performs quite well in other use cases.

import time
from random import sample, shuffle
from blist import blist

a = range(100000)
aBlist = blist([i for i in a])

b = sample(a, 1000)
a.reverse()

def print_timing(func):
    def wrapper(*arg):
        t1 = time.time()
        res = func(*arg)
        t2 = time.time()
        print '%s took %0.3fms' % (func.func_name, (t2-t1)*1000.0)
        return res
    return wrapper


def forIn():
    ret = []
    for obj in b:
        if obj in a:
            ret.append(obj)
    return ret

def forInBlist():
    ret = []
    for obj in b:
        if obj in aBlist:
            ret.append(obj)
    return ret


def forInListComp():
    return [value for value in b if value in a] 


def forInSet():
    ret = []
    for obj in b:
        if obj in set(a):
            ret.append(obj)
    return ret


def setIntersection(): 
    return set(a).intersection(b) 


def setIn():
    return list(set(a) & set(b))


@print_timing
def runForIn(times):
    for i in range(times):
        ret = forIn()
    return ret
        
@print_timing
def runForInBlist(times):
    for i in range(times):
        ret = forInBlist()
    return ret

@print_timing
def runForInListComp(times):
    for i in range(times):
        ret = forInListComp()
    return ret

@print_timing
def runForInSet(times):
    for i in range(times):
        ret = forInSet()
    return ret

@print_timing
def runSetIntersection(times):
    for i in range(times):
        ret = setIntersection()
    return ret

@print_timing
def runSetIn(times):
    for i in range(times):
        ret = setIn()
    return ret

def checkResults(results):
    master = None
    for resultSet in results:
        if not master:
            master = sorted(list(resultSet))
            continue
        try:
            if master != sorted(list(resultSet)):
                return False, master, sorted(list(resultSet))
        except:
            print resultSet
            return False
    return True

iterations = 5
results = []
runForInResults = runForIn(iterations)
results.append(runForInResults)

runForInBlistResults = runForInBlist(iterations)
results.append(runForInBlistResults)
print checkResults(results)

runForInListCompResults = runForInListComp(iterations)
results.append(runForInListCompResults)
print checkResults(results)

runForInSetResults = runForInSet(iterations)
results.append(runForInSetResults)
print checkResults(results)

runSetIntersectionResults = runSetIntersection(iterations)
results.append(runSetIntersectionResults)
print checkResults(results)

runSetInResults = runSetIn(iterations)
results.append(runSetInResults)
print checkResults(results)

answered Nov 10 '22 06:11

Jason Wiener

Related questions
                            
                                Writing video with OpenCV + Python + Mac
                            
                                Using matplotlib slider widget to change clim in image
                            
                                Python: Problem with raw_input reading a number
                            
                                Completing object with its relations and avoiding unnecessary queries in sqlalchemy
                            
                                How to open excel file fast in Python?
                            
                                "as of" in numpy
                            
                                How do I make 'import x' return a subclass of types.ModuleType?
                            
                                pickle can't import a module that exists?
                            
                                How do I access the registers with python in gdb
                            
                                Writing RDF/XML file from rdf Triples in rdflib
                            
                                Python httplib and POST
                            
                                Django-Haystack with Solr contains search
                            
                                how to parse a list or string into chunks of fixed length
                            
                                python check if utf-8 string is uppercase
                            
                                ctypes pointer into the middle of a numpy array
                            
                                Multiple subprocesses with timeouts
                            
                                Python random seed not working with Genetic Programming example code
                            
                                Python web service with Twisted
                            
                                Import a library that imports a GPL library? [closed]
                            
                                How to test for python support in a web host

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With