How to best write a Python function (<code>check_list</code>) to efficiently test if an element (<code>x</code>) occurs at least <code>n</code> times in a list (<code>l</code>)? My first thought was: <pre class="prettyprint"><code>def check_list(l, x, n): return l.count(x) >= n </code></pre> But this doesn't short-circuit once <code>x</code> has been found <code>n</code> times and is always O(n). A simple approach that does short-circuit would be: <pre class="prettyprint"><code>def check_list(l, x, n): count = 0 for item in l: if item == x: count += 1 if count == n: return True return False </code></pre> I also have a more compact short-circuiting solution with a generator: <pre class="prettyprint"><code>def check_list(l, x, n): gen = (1 for item in l if item == x) return all(next(gen,0) for i in range(n)) </code></pre> Are there other good solutions? What is the best efficient approach? Thank you

Instead of incurring extra overhead with the setup of a <code>range</code> object and using <code>all</code> which has to test the truthiness of each item, you could use <code>itertools.islice</code> to advance the generator <code>n</code> steps ahead, and then return the next item in the slice if the slice exists or a default <code>False</code> if not: <pre class="prettyprint"><code>from itertools import islice def check_list(lst, x, n): gen = (True for i in lst if i==x) return next(islice(gen, n-1, None), False) </code></pre> Note that like <code>list.count</code>, <code>itertools.islice</code> also runs at C speed. And this has the extra advantage of handling iterables that are not lists. <hr> Some timing: <pre class="prettyprint"><code>In [1]: from itertools import islice In [2]: from random import randrange In [3]: lst = [randrange(1,10) for i in range(100000)] In [5]: %%timeit # using list.index ....: check_list(lst, 5, 1000) ....: 1000 loops, best of 3: 736 µs per loop In [7]: %%timeit # islice ....: check_list(lst, 5, 1000) ....: 1000 loops, best of 3: 662 µs per loop In [9]: %%timeit # using list.index ....: check_list(lst, 5, 10000) ....: 100 loops, best of 3: 7.6 ms per loop In [11]: %%timeit # islice ....: check_list(lst, 5, 10000) ....: 100 loops, best of 3: 6.7 ms per loop </code></pre>

You could use the second argument of <code>index</code> to find the subsequent indices of occurrences: <pre class="prettyprint"><code>def check_list(l, x, n): i = 0 try: for _ in range(n): i = l.index(x, i)+1 return True except ValueError: return False print( check_list([1,3,2,3,4,0,8,3,7,3,1,1,0], 3, 4) ) </code></pre> <h3>About <code>index</code> arguments</h3> The official documentation does not mention in its Python Tutuorial, section 5 the method's second or third argument, but you can find it in the more comprehensive Python Standard Library, section 4.6: <blockquote> <code>s.index(x[, i[, j]])</code> index of the first occurrence of x in s (at or after index i and before index j) (8) (8) <code>index</code> raises <code>ValueError</code> when x is not found in s. When supported, the additional arguments to the index method allow efficient searching of subsections of the sequence. Passing the extra arguments is roughly equivalent to using <code>s[i:j].index(x)</code>, only without copying any data and with the returned index being relative to the start of the sequence rather than the start of the slice. </blockquote> <h3>Performance Comparison</h3> In comparing this <code>list.index</code> method with the <code>islice(gen)</code> method, the most important factor is the distance between the occurrences to be found. Once that distance is on average 13 or more, the <code>list.index</code> has a better performance. For lower distances, the fastest method also depends on the number of occurrences to find. The more occurrences to find, the sooner the <code>islice(gen)</code> method outperforms <code>list.index</code> in terms of average distance: this gain fades out when the number of occurrences becomes really large. The following graph draws the (approximate) border line, at which both methods perform equally well (the X-axis is logarithmic): <img src="https://i.stack.imgur.com/36PZO.png" alt="enter image description here">

Efficiently check if an element occurs at least n times in a list

Tags:

performance

python

list

optimization

python-3.x

How to best write a Python function (check_list) to efficiently test if an element (x) occurs at least n times in a list (l)?

My first thought was:

def check_list(l, x, n):
    return l.count(x) >= n

But this doesn't short-circuit once x has been found n times and is always O(n).

A simple approach that does short-circuit would be:

def check_list(l, x, n):
    count = 0
    for item in l:
        if item == x:
            count += 1
            if count == n:
                return True
    return False

I also have a more compact short-circuiting solution with a generator:

def check_list(l, x, n):
    gen = (1 for item in l if item == x)
    return all(next(gen,0) for i in range(n))

Are there other good solutions? What is the best efficient approach?

Thank you

217

asked Oct 31 '16 21:10

Chris_Rands

3 Answers

Instead of incurring extra overhead with the setup of a range object and using all which has to test the truthiness of each item, you could use itertools.islice to advance the generator n steps ahead, and then return the next item in the slice if the slice exists or a default False if not:

from itertools import islice

def check_list(lst, x, n):
    gen = (True for i in lst if i==x)
    return next(islice(gen, n-1, None), False)

Note that like list.count, itertools.islice also runs at C speed. And this has the extra advantage of handling iterables that are not lists.

Some timing:

In [1]: from itertools import islice

In [2]: from random import randrange

In [3]: lst = [randrange(1,10) for i in range(100000)]

In [5]: %%timeit # using list.index
   ....: check_list(lst, 5, 1000)
   ....:
1000 loops, best of 3: 736 µs per loop

In [7]: %%timeit # islice
   ....: check_list(lst, 5, 1000)
   ....:
1000 loops, best of 3: 662 µs per loop

In [9]: %%timeit # using list.index
   ....: check_list(lst, 5, 10000)
   ....:
100 loops, best of 3: 7.6 ms per loop

In [11]: %%timeit # islice
   ....: check_list(lst, 5, 10000)
   ....:
100 loops, best of 3: 6.7 ms per loop

122

answered Oct 19 '22 19:10

Moses Koledoye

You could use the second argument of index to find the subsequent indices of occurrences:

def check_list(l, x, n):
    i = 0
    try:
        for _ in range(n):
            i = l.index(x, i)+1
        return True
    except ValueError:
        return False

print( check_list([1,3,2,3,4,0,8,3,7,3,1,1,0], 3, 4) )

About `index` arguments

The official documentation does not mention in its Python Tutuorial, section 5 the method's second or third argument, but you can find it in the more comprehensive Python Standard Library, section 4.6:

s.index(x[, i[, j]]) index of the first occurrence of x in s (at or after index i and before index j) ⁽⁸⁾

⁽⁸⁾ index raises ValueError when x is not found in s. When supported, the additional arguments to the index method allow efficient searching of subsections of the sequence. Passing the extra arguments is roughly equivalent to using s[i:j].index(x), only without copying any data and with the returned index being relative to the start of the sequence rather than the start of the slice.

Performance Comparison

In comparing this list.index method with the islice(gen) method, the most important factor is the distance between the occurrences to be found. Once that distance is on average 13 or more, the list.index has a better performance. For lower distances, the fastest method also depends on the number of occurrences to find. The more occurrences to find, the sooner the islice(gen) method outperforms list.index in terms of average distance: this gain fades out when the number of occurrences becomes really large.

The following graph draws the (approximate) border line, at which both methods perform equally well (the X-axis is logarithmic):

enter image description here

answered Oct 19 '22 20:10

trincot

Ultimately short circuiting is the way to go if you expect a significant number of cases will lead to early termination. Let's explore the possibilities:

Take the case of the list.index method versus the list.count method (these were the two fastest according to my testing, although ymmv)

For list.index if the list contains n or more of x and the method is called n times. Whilst within the list.index method, execution is very fast, allowing for much faster iteration than the custom generator. If the occurances of x are far enough apart, a large speedup will be seen from the lower level execution of index. If instances of x are close together (shorter list / more common x's), much more of the time will be spent executing the slower python code that mediates the rest of the function (looping over n and incrementing i)

The benefit of list.count is that it does all of the heavy lifting outside of slow python execution. It is a much easier function to analyse, as it is simply a case of O(n) time complexity. By spending almost none of the time in the python interpreter however it is almost gaurenteed to be faster for short lists.

Summary of selection criteria:

shorter lists favor list.count
lists of any length that don't have a high probability to short circuit favor list.count
lists that are long and likely to short circuit favor list.index

answered Oct 19 '22 19:10

Aaron

Related questions
                            
                                png images to one pdf in python
                            
                                Color Range Python
                            
                                Python numpy.random.normal
                            
                                panda dataframe to ordered dictionary
                            
                                How to select related in django model so it wont generate a lot of subqueries
                            
                                How to create a second None in Python? Making a singleton object where the id is always the same
                            
                                Python lxml etree.tostring() returns empty string running on mod_wsgi
                            
                                Creating PyPi package - Could not find a version that satisfies the requirement iso8601 [duplicate]
                            
                                How to add edge in mesh using Maya Python API 2.0
                            
                                ConcatOp : Dimensions of inputs should match
                            
                                Spark Dataframes: Skewed Partition after Join
                            
                                Pandas idiomatic way to custom fillna
                            
                                Reshaping Pandas Dataframe with Grouped Data (Long to Wide)
                            
                                Django: Update multiple objects attributes
                            
                                isinstance not working for Decimal in AppEngine
                            
                                Pandas read_csv, reading a boolean with missing values specified as an int
                            
                                Removing text while processing the image
                            
                                uWSGI NOT working with .ini file
                            
                                Python: understanding (None for g in g if (yield from g) and False)
                            
                                why can't I import geopandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Efficiently check if an element occurs at least n times in a list

Tags:

performance

python

list

optimization

python-3.x

Chris_Rands

People also ask

3 Answers

Moses Koledoye

About `index` arguments

Performance Comparison

trincot

Aaron

Recent Activity

Donate For Us

Efficiently check if an element occurs at least n times in a list

Tags:

performance

python

list

optimization

python-3.x

Chris_Rands

People also ask

3 Answers

Moses Koledoye

About index arguments

Performance Comparison

trincot

Aaron

Related questions

Recent Activity

Donate For Us

About `index` arguments