Fastest possible way to iterate through a specific list?

Tags:

Let's say I have a list:

list=['plu;ean;price;quantity','plu1;ean1;price1;quantity1']

I want to iterate over the list + split the list by ";" and put an if clause, like this:

for item in list:
    split_item=item.split(";")
    if split_item[0] == "string_value" or split_item[1] == "string_value":
        do something.....

I was wondering, if this is the fastest way possible? Let's say my initial list is a lot bigger (has a lot more list items). I tried with list comprehensions:

item=[item.split(";") for item in list if item.split(";")[0] == "string_value" or item.split(";")[1] == "string_value"]

But this is actually giving me slower results. The first case is giving me an average of 90ms, while the second one is giving me an average of 130ms. Am I doing the list comprehension wrong? Is there a faster solution?

301

asked Sep 25 '13 18:09

sale1902

3 Answers

I was wondering, if this is the fastest way possible?

No, of course not. You can implement it a lot faster in hand-coded assembly than in Python. So what?

If the "do something..." is not trivial, and there are many matches, the cost to do something 100000 times is going to be a lot more expensive than the cost of looping 500000 times, so finding the fastest way to loop doesn't matter at all.

In fact, just calling split two to three each loop instead of remembering and reusing the result is going to swamp the cost of iteration, and not passing a maxsplit argument when you only care about two results may as well.

So, you're trying to optimize the wrong thing. But what if, after you fix everything else, it turns out that the cost of iteration really does matter here?

Well, you can't use a comprehension directly to speed things up, because comprehensions are for expressions that return values, not statements to do things.

But, if you look at your code, you'll realize you're actually doing three things: splitting each string, then filtering out the ones that don't match, then doing the "do something". So, you can use a comprehension for the first two parts, and then you're only using a slow for loop for the much smaller list of values that passed the filter.

It looks like you tried this, but you made two mistakes.

First, you're better off with a generator expression than a list comprehension—you don't need a list here, just something to iterator over, so don't pay to build one.

Second, you don't want to split the string three times. You can probably find some convoluted way to get the split done once in a single comprehension, but why bother? Just write each step as its own step.

So:

split_items = (item.split(';') for item in items)
filtered_items = (item for item in split_items 
                  if item[0] == "string_value" or item[1] == "string_value")
for item in filtered_items:
    do something...

Will this actually be faster? If you can get some real test data, and "do something..." code, that shows that the iteration is a bottleneck, you can test on that real data and code. Until then, there's nothing to test.

142

answered Oct 12 '22 22:10

abarnert

Split the whole string only when the first two items retrieved from str.split(';', 2) satisfy the conditions:

>>> strs = 'plu;ean;price;quantity'
>>> strs.split(';', 2)
['plu', 'ean', 'price;quantity']

Here split the third item('price;quantity') only if the first two items have satisfied the condition:

>>> lis = ['plu;ean;price;quantity'*1000, 'plu1;ean1;price1;quantity1'*1000]*1000

Normal for-loop, single split of whole string for each item of the list.

>>> %%timeit
for item in lis:
    split_item=item.split(";")
    if split_item[0] == "plu" or split_item[1] == "ean":pass
... 
1 loops, best of 3: 952 ms per loop

List comprehension equivalent to the for-loop above:

>>> %timeit [x for x in (item.split(';') for item in lis) if x[0]== "plu" or x[1]=="ean"]
1 loops, best of 3: 961 ms per loop

Split on-demand:

>>> %timeit [[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== "plu" or y=="ean"]
1 loops, best of 3: 508 ms per loop

Of course, if the list and strings are small then such optimisation doesn't matter.

answered Oct 12 '22 23:10

Ashwini Chaudhary

EDIT: It turns out that the Regex cache was being a bit unfair to the competition. My bad. Regex is only a small percentage faster.

If you're looking for speed, hcwhsa's answer should be good enough. If you need slightly more, look to re.

import re
from itertools import chain

lis = ['plu;ean;price;quantity'*1000, 'plu1;ean1;price1;quantity1'*100]*1000

matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match
[l.split(';') for l in lis if matcher(l)]

Timings, for mostly positive results (aka. split is the major cause of slowness):

SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match

lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)] + ['plu;ean;price;quantity' for i in range(10000)]
"

python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"

We see mine's a little faster.

10 loops, best of 3: 55 msec per loop
10 loops, best of 3: 49.5 msec per loop

For mostly negative results (most things are filtered):

SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match

lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(1000)] + ['plu;ean;price;quantity' for i in range(10000)]
"

python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"

The lead's a touch higher.

10 loops, best of 3: 40.9 msec per loop
10 loops, best of 3: 35.7 msec per loop

If the result will always be unique, use

next([x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean')

or the faster Regex version

next(filter(matcher, lis)).split(';')

(use itertools.ifilter on Python 2).

Timings:

SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match

lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)] + ['plu;ean;price;quantity'] + ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)]
"

python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "next([x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean')"

python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"
python -m timeit -s "$SETUP" "next(filter(matcher, lis)).split(';')"

Results:

10 loops, best of 3: 31.3 msec per loop
100 loops, best of 3: 15.2 msec per loop
10 loops, best of 3: 28.8 msec per loop
100 loops, best of 3: 14.1 msec per loop

So this gives a substantial boost to both methods.

answered Oct 12 '22 22:10

Veedrac

Related questions
                            
                                cv2.cornersSubPix gives only None
                            
                                Is there a configuration file for Numpy?
                            
                                SQLAlchemy Columns result processing
                            
                                Django: select values with max timestamps or join to the same table
                            
                                How to expose a C++ class to Python without building a module
                            
                                create heatmap2d from txt file
                            
                                Extracting string between quotes split across multiple lines in Python
                            
                                passing variable from javascript to server (django)
                            
                                how to set custom forloop start point in django template
                            
                                Python httplib SSL23_GET_SERVER_HELLO:unknown protocol
                            
                                Pass a parent class as an argument?
                            
                                Retrieve id of Jenkins build started with the API using the "location" information in header (new feature of jenkins 1.529)
                            
                                Using python scipy to fit gamma distribution to data
                            
                                Route to worker depending on result in Celery?
                            
                                Python - how can I address an array along a given axis?
                            
                                Comparing two OpenCV images/2D Numpy arrays
                            
                                Getting Attempted relative import in non-package error in spite of having __init__.py
                            
                                How many times can `__del__` be called per object in Python?
                            
                                Overwriting (updating) a pandas Series with values from another Series?
                            
                                how to enumerate OrderedDict in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest possible way to iterate through a specific list?

Tags:

performance

python

list

list-comprehension