Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest possible way to iterate through a specific list?

Let's say I have a list:

list=['plu;ean;price;quantity','plu1;ean1;price1;quantity1']

I want to iterate over the list + split the list by ";" and put an if clause, like this:

for item in list:
    split_item=item.split(";")
    if split_item[0] == "string_value" or split_item[1] == "string_value":
        do something.....

I was wondering, if this is the fastest way possible? Let's say my initial list is a lot bigger (has a lot more list items). I tried with list comprehensions:

item=[item.split(";") for item in list if item.split(";")[0] == "string_value" or item.split(";")[1] == "string_value"]

But this is actually giving me slower results. The first case is giving me an average of 90ms, while the second one is giving me an average of 130ms. Am I doing the list comprehension wrong? Is there a faster solution?

like image 301
sale1902 Avatar asked Sep 25 '13 18:09

sale1902


People also ask

What is the best way to iterate a list?

The best way to iterate the list in terms of performance would be to use iterators ( your second approach using foreach ).

How do I iterate faster in Python?

A faster way to loop using built-in functions A faster way to loop in Python is using built-in functions. In our example, we could replace the for loop with the sum function. This function will sum the values inside the range of numbers. The code above takes 0.84 seconds.

Is it faster to iterate through list or set Python?

Iterating over a List is much much faster than iterating over a set. The currently accepted answer is using a very small set and list and hence, the difference is negligible there.


3 Answers

I was wondering, if this is the fastest way possible?

No, of course not. You can implement it a lot faster in hand-coded assembly than in Python. So what?

If the "do something..." is not trivial, and there are many matches, the cost to do something 100000 times is going to be a lot more expensive than the cost of looping 500000 times, so finding the fastest way to loop doesn't matter at all.

In fact, just calling split two to three each loop instead of remembering and reusing the result is going to swamp the cost of iteration, and not passing a maxsplit argument when you only care about two results may as well.


So, you're trying to optimize the wrong thing. But what if, after you fix everything else, it turns out that the cost of iteration really does matter here?

Well, you can't use a comprehension directly to speed things up, because comprehensions are for expressions that return values, not statements to do things.

But, if you look at your code, you'll realize you're actually doing three things: splitting each string, then filtering out the ones that don't match, then doing the "do something". So, you can use a comprehension for the first two parts, and then you're only using a slow for loop for the much smaller list of values that passed the filter.

It looks like you tried this, but you made two mistakes.

First, you're better off with a generator expression than a list comprehension—you don't need a list here, just something to iterator over, so don't pay to build one.

Second, you don't want to split the string three times. You can probably find some convoluted way to get the split done once in a single comprehension, but why bother? Just write each step as its own step.

So:

split_items = (item.split(';') for item in items)
filtered_items = (item for item in split_items 
                  if item[0] == "string_value" or item[1] == "string_value")
for item in filtered_items:
    do something...

Will this actually be faster? If you can get some real test data, and "do something..." code, that shows that the iteration is a bottleneck, you can test on that real data and code. Until then, there's nothing to test.

like image 142
abarnert Avatar answered Oct 12 '22 22:10

abarnert


Split the whole string only when the first two items retrieved from str.split(';', 2) satisfy the conditions:

>>> strs = 'plu;ean;price;quantity'
>>> strs.split(';', 2)
['plu', 'ean', 'price;quantity']

Here split the third item('price;quantity') only if the first two items have satisfied the condition:

>>> lis = ['plu;ean;price;quantity'*1000, 'plu1;ean1;price1;quantity1'*1000]*1000

Normal for-loop, single split of whole string for each item of the list.

>>> %%timeit
for item in lis:
    split_item=item.split(";")
    if split_item[0] == "plu" or split_item[1] == "ean":pass
... 
1 loops, best of 3: 952 ms per loop

List comprehension equivalent to the for-loop above:

>>> %timeit [x for x in (item.split(';') for item in lis) if x[0]== "plu" or x[1]=="ean"]
1 loops, best of 3: 961 ms per loop

Split on-demand:

>>> %timeit [[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== "plu" or y=="ean"]
1 loops, best of 3: 508 ms per loop

Of course, if the list and strings are small then such optimisation doesn't matter.

like image 31
Ashwini Chaudhary Avatar answered Oct 12 '22 23:10

Ashwini Chaudhary


EDIT: It turns out that the Regex cache was being a bit unfair to the competition. My bad. Regex is only a small percentage faster.

If you're looking for speed, hcwhsa's answer should be good enough. If you need slightly more, look to re.

import re
from itertools import chain

lis = ['plu;ean;price;quantity'*1000, 'plu1;ean1;price1;quantity1'*100]*1000

matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match
[l.split(';') for l in lis if matcher(l)]

Timings, for mostly positive results (aka. split is the major cause of slowness):

SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match

lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)] + ['plu;ean;price;quantity' for i in range(10000)]
"

python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"

We see mine's a little faster.

10 loops, best of 3: 55 msec per loop
10 loops, best of 3: 49.5 msec per loop

For mostly negative results (most things are filtered):

SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match

lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(1000)] + ['plu;ean;price;quantity' for i in range(10000)]
"

python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"

The lead's a touch higher.

10 loops, best of 3: 40.9 msec per loop
10 loops, best of 3: 35.7 msec per loop

If the result will always be unique, use

next([x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean')

or the faster Regex version

next(filter(matcher, lis)).split(';')

(use itertools.ifilter on Python 2).

Timings:

SETUP="
import re
from itertools import chain
matcher = re.compile('^(?:plu(?:;|$)|[^;]*;ean(?:;|$))').match

lis = ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)] + ['plu;ean;price;quantity'] + ['plu1;ean1;price1;quantity1'+chr(i) for i in range(10000)]
"

python -m timeit -s "$SETUP" "[[x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean']"
python -m timeit -s "$SETUP" "next([x] + [y] + z.split(';') for x, y, z in (item.split(';', 2) for item in lis) if x== 'plu' or y=='ean')"

python -m timeit -s "$SETUP" "[l.split(';') for l in lis if matcher(l)]"
python -m timeit -s "$SETUP" "next(filter(matcher, lis)).split(';')"

Results:

10 loops, best of 3: 31.3 msec per loop
100 loops, best of 3: 15.2 msec per loop
10 loops, best of 3: 28.8 msec per loop
100 loops, best of 3: 14.1 msec per loop

So this gives a substantial boost to both methods.

like image 38
Veedrac Avatar answered Oct 12 '22 22:10

Veedrac