Say I have different sets (they have to be different, I cannot join them as per the kind of data I am working with): <pre class="prettyprint"><code>r = set([1,2,3]) s = set([4,5,6]) t = set([7,8,9]) </code></pre> What is the best way to check if a given variable is present in any of them? I am using: <pre class="prettyprint"><code>if myvar in r \ or myvar in s \ or myvar in t: </code></pre> But I wonder if this can be reduced somehow by using <code>set</code>'s properties such as <code>union</code>. The following works, but I can't find a way to define multiple unions: <pre class="prettyprint"><code>if myvar in r.union(s) or myvar in t: </code></pre> I am also wondering if this union will somehow affect performance, since I guess a temporary <code>set</code> will be created on the fly.

You can use builtin any: <pre class="prettyprint"><code>r = set([1,2,3]) s = set([4,5,6]) t = set([7,8,9]) if any(myvar in x for x in [r,s,t]): print "I'm in one of them" </code></pre> <code>any</code> will short circuit on the first condition that returns <code>True</code> so you can get around constructing a potentially huge <code>union</code> or checking potentially lots of sets for inclusion. And I am also wondering if this union will affect somehow performance, since I guess a temporary set will be created on the fly. According to wiki.python.com <code>s|t</code> is <code>O(len(s)+len(t))</code> while lookups are <code>O(1)</code> . For <code>n</code> sets with <code>l</code> elements each , doing <code>union</code> iteratively to construct the set will result in: <pre class="prettyprint"><code>a.union(b).union(c).union(d) .... .union(n) </code></pre> Which is equivalent to <code>O(l+l)</code> for <code>a.union(b)</code> and <code>O(2l+2l+l)</code> <code>a.union(b).union(c)</code> and so on which sums up to <code>O(n*(n+1)/2)*l)</code>. <code>O(n^2*l)</code> is quadratic and voids the performance advantage of using sets. The lookup in n sets with <code>any</code> will perform at <code>O(n)</code>

How to check if a value is present in any of given sets

Tags:

python

set

Say I have different sets (they have to be different, I cannot join them as per the kind of data I am working with):

r = set([1,2,3])
s = set([4,5,6])
t = set([7,8,9])

What is the best way to check if a given variable is present in any of them?

I am using:

if myvar in r \
   or myvar in s \
   or myvar in t:

But I wonder if this can be reduced somehow by using set's properties such as union.

The following works, but I can't find a way to define multiple unions:

if myvar in r.union(s)
   or myvar in t:

I am also wondering if this union will somehow affect performance, since I guess a temporary set will be created on the fly.

753

asked Jun 01 '15 10:06

fedorqui 'SO stop harming'

2 Answers

Just use any:

if any(myvar in x for x in  (r,s,t))

set lookups are 0(1) so creating a union to check if the variable is in any set is totally unnecessary instead of simply checking using in with any which will short circuit as soon as a match is found and does not create a new set.

And I am also wondering if this union will affect somehow performance

Yes of course unioning the sets affects performance, it adds to the complexity, you are creating a new set every time which is O(len(r)+len(s)+len(t)) so you can say goodbye to the real point of using sets which are efficient lookups.

So the bottom line is that is you want to keep efficient lookups you will have to union the set once and keep them in memory creating a new variable then using that to do your lookup for myvar so the initial creation will be 0(n) and lookups will be 0(1) thereafter.

If you don't every time you want to do a lookup first creating the union you will have a linear solution in the length of r+s+t -> set.union(*(r, s, t)) as opposed to at worst three constant(on average) lookups. That also means always adding or removing any elements from the new unioned set that are added/removed from r,s or t.

Some realistic timings on moderately large sized sets show exactly the difference:

In [1]: r = set(range(10000))

In [2]: s = set(range(10001,20000))

In [3]: t = set(range(20001,30000))

In [4]: timeit any(29000 in st for st in (r,s,t))
1000000 loops, best of 3: 869 ns per loop  

In [5]: timeit 29000 in r | s | t
1000 loops, best of 3: 956 µs per loop

In [6]: timeit 29000 in reduce(lambda x,y :x.union(y),[r,s,t])
1000 loops, best of 3: 961 µs per loop

In [7]: timeit 29000 in r.union(s).union(t)
1000 loops, best of 3: 953 µs per loop

Timing the union shows that pretty much all the time is spent in the union calls:

In [8]: timeit r.union(s).union(t)
1000 loops, best of 3: 952 µs per loop

Using larger sets and getting the element in the last set:

In [15]: r = set(range(1000000))

In [16]: s = set(range(1000001,2000000))

In [17]: t = set(range(2000001,3000000))


In [18]: timeit any(2999999 in st for st in (r,s,t))
1000000 loops, best of 3: 878 ns per loop

In [19]: timeit 2999999 in reduce(lambda x,y :x.union(y),[r,s,t])
1 loops, best of 3: 161 ms per loop

In [20]: timeit 2999999 in r | s | t
10 loops, best of 3: 157 ms per loop

There is literally no difference no matter how large the sets get using any but as the set sizes grow so does the running time using union.

The only way to make it faster would be to stick to or but we are taking the difference of a few hundred nanoseconds which is the cost of creating the generator expression and the function call:

In [22]: timeit 2999999 in r or 2999999 in s or 2999999 in t
10000000 loops, best of 3: 152 ns per loop

To union sets set.union(*(r, s, t)) is also the fastest as you don't build intermediary sets:

In [47]: timeit 2999999 in set.union(*(r,s,t))
10 loops, best of 3: 108 ms per loop
In [49]: r | s | t  == set.union(*(r,s,t))
Out[49]: True

147

answered Oct 12 '22 22:10

Padraic Cunningham

You can use builtin any:

r = set([1,2,3])
s = set([4,5,6])
t = set([7,8,9])
if any(myvar in x for x in [r,s,t]):
    print "I'm in one of them"

any will short circuit on the first condition that returns True so you can get around constructing a potentially huge union or checking potentially lots of sets for inclusion.

And I am also wondering if this union will affect somehow performance, since I guess a temporary set will be created on the fly.

According to wiki.python.com s|t is O(len(s)+len(t)) while lookups are O(1) .

For n sets with l elements each , doing union iteratively to construct the set will result in:

a.union(b).union(c).union(d) .... .union(n)

Which is equivalent to O(l+l) for a.union(b) and O(2l+2l+l) a.union(b).union(c) and so on which sums up to O(n*(n+1)/2)*l).

O(n^2*l) is quadratic and voids the performance advantage of using sets.

The lookup in n sets with any will perform at O(n)

answered Oct 12 '22 23:10

Sebastian Wozny

Related questions
                            
                                python factory functions compared to class
                            
                                Is Tornado really non-blocking?
                            
                                Where do you store the variables in jinja?
                            
                                How to specify rows and columns of a <textarea > tag using wtforms
                            
                                In Python why is [2] less than (1,)?
                            
                                Launching an app in heroku? What is procfile? 'web:' command?
                            
                                How to collapse a list into a string in python? [duplicate]
                            
                                Python referencing old SSL version
                            
                                How do I run psycopg2 on El Capitan without hitting a libssl error
                            
                                Django ImageField upload_to path
                            
                                How to make len() work with different methods on different instances of a class, without modifying the class?
                            
                                Factory Boy random choice for a field with field option "choices"
                            
                                How to pass an array to python through command line [duplicate]
                            
                                flask production and development mode
                            
                                How to use math.log10 function on whole pandas dataframe
                            
                                Getting this simple problem while importing Xgboost on Jupyter notebook
                            
                                How do I merge a 2D array in Python into one string with List Comprehension?
                            
                                How to read from stdin or from a file if no data is piped in Python?
                            
                                How do I create a new file on a remote host in fabric (python deployment tool)?
                            
                                load parameters from a file in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With