I have a list of strings (words like), and, while I am parsing a text, I need to check if a word belongs to the group of words of my current list.
However, my input is pretty big (about 600 millions lines), and checking if an element belongs to a list is a O(n) operation according to the Python documentation.
My code is something like:
words_in_line = []
for word in line:
if word in my_list:
words_in_line.append(word)
As it takes too much time (days actually), I wanted to improve that part which is taking most of the time. I have a look at Python collections, and, more precisely, at deque. However, the only give a O(1) operation time access to the head and the tail of a list, not in the middle.
Do someone has an idea about how to do that in a better way?
You might consider a trie or a DAWG or a database. There are several Python implementations of the same.
Here is some relative timings for you to consider of a set vs a list:
import timeit
import random
with open('/usr/share/dict/words','r') as di: # UNIX 250k unique word list
all_words_set={line.strip() for line in di}
all_words_list=list(all_words_set) # slightly faster if this list is sorted...
test_list=[random.choice(all_words_list) for i in range(10000)]
test_set=set(test_list)
def set_f():
count = 0
for word in test_set:
if word in all_words_set:
count+=1
return count
def list_f():
count = 0
for word in test_list:
if word in all_words_list:
count+=1
return count
def mix_f():
# use list for source, set for membership testing
count = 0
for word in test_list:
if word in all_words_set:
count+=1
return count
print "list:", timeit.Timer(list_f).timeit(1),"secs"
print "set:", timeit.Timer(set_f).timeit(1),"secs"
print "mixed:", timeit.Timer(mix_f).timeit(1),"secs"
Prints:
list: 47.4126560688 secs
set: 0.00277495384216 secs
mixed: 0.00166988372803 secs
ie, matching a set of 10000 words against a set of 250,000 words is 17,085 X faster than matching a list of same 10000 words in a list of the same 250,000 words. Using a list for the source and a set for membership testing is 28,392 X faster than an unsorted list alone.
For membership testing, a list is O(n) and sets and dicts are O(1) for lookups.
Conclusion: Use better data structures for 600 million lines of text!
I'm not clear on why you chose a list in the first place, but here are some alternatives:
Using a set() is likely a good idea. This is very fast, though unordered, but sometimes that's exactly what's needed.
If you need things ordered and to have arbitrary lookups as well, you could use a tree of some sort: http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/
If set membership testing with a small number of false positives here or there is acceptable, you might check into a bloom filter: http://stromberg.dnsalias.org/~strombrg/drs-bloom-filter/
Depending on what you're doing, a trie might also be very good.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With