Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Performance: remove item from list [duplicate]

I have a list with lenght of: 370000. In this list i have items like: "a", "y", "Y", "q", "Q", "p", "P",, meaning this is a list of words but from time to time i get those single characters.

I want to remove those characters from the list, i am pretty new in python but the first thing that came to my mind was to do something like:

for word in words:
    if word== 'm' or  word== 'y' or word== 'Y' or word== 'p' or word== 'Q' or word== 'q' or word== 'a' or word== 'uh':
        words.remove(word)

In a list with 370.000 items this method is taking a loooot of time. Seriously, a lot.

Does anyone have another awesome idea on how to get a better performance?

Thanks in advance.

like image 600
NachoMiguel Avatar asked Oct 02 '15 23:10

NachoMiguel


3 Answers

Tried some bogo-benchmarks in IPython.

import random
# Don't know how to generate your words, use integers as substitute.
words = [random.randint(0, 25) for i in xrange(370000)]
badlist = range(7)
badtuple = tuple(badlist)
badset = set(badlist)
# List comprehension
%timeit [w for w in words if w not in badlist]
10 loops, best of 3: 59.2 ms per loop
%timeit [w for w in words if w not in badtuple]
10 loops, best of 3: 64.7 ms per loop
%timeit [w for w in words if w not in badset]
10 loops, best of 3: 30.3 ms per loop
# Filter
%timeit filter(lambda w: w not in badlist, words)
10 loops, best of 3: 85.6 ms per loop
%timeit filter(lambda w: w not in badtuple, words)
10 loops, best of 3: 92.1 ms per loop
%timeit filter(lambda w: w not in badset, words)
10 loops, best of 3: 50.8 ms per loop

Conclusion: Probably, using list comprehension with not in <set> as filter condition would be best.

But as I've said, this benchmark is bogus, and you need to repeat some benchmarks on the actual kind of data you'll encounter to see which is better.


Some idea on why list comprehension + "not in set" is probably optimal.

  1. filter vs list comprehension: filter actually calls the input callable, and callable-calling in Python has its own overhead (creating stack frame, etc.) Also filter tries to be smart and return the correct type, which adds to overhead. (This is actually infinitesimally small) Instead, list comprehension's condition check (the if ... clause) has less overhead than a call. It's just expression evaluation without the full bells and whistles of Python call stack.
  2. Test for set membership is O(1) in average case and O(n) in worst case, but list/tuple membership is always O(n).
like image 184
Cong Ma Avatar answered Nov 18 '22 09:11

Cong Ma


You can use list comprehension, something like:

words = [word for word in words if word not in ["a", "y", "Y", "q", "Q", "p", "P", "uh"]]

list comprehension tends to give much better performance.

Edit (thanks to results from Cong Ma):

Seems the best performance is from using a set as filter sequence, so you want something more like:

words = [word for word in words if word not in set(("a", "y", "Y", "q", "Q", "P", "uh"))]
like image 3
Paul Evans Avatar answered Nov 18 '22 10:11

Paul Evans


"but from time to time i get those single characters."

I think the logic is poor in here. You should eliminate it when inserted word into the list. Removing it after a lengthy List is a bad choice after all.

I've face into the same problem, at first my solution is using pypy

I think there's a problem in pypy at that time (my code suddenly exited), so i change the code with a better logic, and use ordinary python.

like image 2
nafsaka Avatar answered Nov 18 '22 08:11

nafsaka