I have a list with lenght of: 370000
. In this list i have items like: "a", "y", "Y", "q", "Q", "p", "P",
, meaning this is a list of words but from time to time i get those single characters.
I want to remove those characters from the list, i am pretty new in python but the first thing that came to my mind was to do something like:
for word in words:
if word== 'm' or word== 'y' or word== 'Y' or word== 'p' or word== 'Q' or word== 'q' or word== 'a' or word== 'uh':
words.remove(word)
In a list with 370.000 items this method is taking a loooot of time. Seriously, a lot.
Does anyone have another awesome idea on how to get a better performance?
Thanks in advance.
Tried some bogo-benchmarks in IPython.
import random
# Don't know how to generate your words, use integers as substitute.
words = [random.randint(0, 25) for i in xrange(370000)]
badlist = range(7)
badtuple = tuple(badlist)
badset = set(badlist)
# List comprehension
%timeit [w for w in words if w not in badlist]
10 loops, best of 3: 59.2 ms per loop
%timeit [w for w in words if w not in badtuple]
10 loops, best of 3: 64.7 ms per loop
%timeit [w for w in words if w not in badset]
10 loops, best of 3: 30.3 ms per loop
# Filter
%timeit filter(lambda w: w not in badlist, words)
10 loops, best of 3: 85.6 ms per loop
%timeit filter(lambda w: w not in badtuple, words)
10 loops, best of 3: 92.1 ms per loop
%timeit filter(lambda w: w not in badset, words)
10 loops, best of 3: 50.8 ms per loop
Conclusion: Probably, using list comprehension with not in <set>
as filter condition would be best.
But as I've said, this benchmark is bogus, and you need to repeat some benchmarks on the actual kind of data you'll encounter to see which is better.
Some idea on why list comprehension + "not in set" is probably optimal.
filter
vs list comprehension: filter
actually calls the input callable, and callable-calling in Python has its own overhead (creating stack frame, etc.) filter
tries to be smart and return the correct type, which adds to overhead.if ...
clause) has less overhead than a call. It's just expression evaluation without the full bells and whistles of Python call stack.You can use list comprehension, something like:
words = [word for word in words if word not in ["a", "y", "Y", "q", "Q", "p", "P", "uh"]]
list comprehension tends to give much better performance.
Edit (thanks to results from Cong Ma):
Seems the best performance is from using a set
as filter sequence, so you want something more like:
words = [word for word in words if word not in set(("a", "y", "Y", "q", "Q", "P", "uh"))]
"but from time to time i get those single characters."
I think the logic is poor in here. You should eliminate it when inserted word into the list. Removing it after a lengthy List is a bad choice after all.
I've face into the same problem, at first my solution is using pypy
I think there's a problem in pypy at that time (my code suddenly exited), so i change the code with a better logic, and use ordinary python.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With