Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing two large lists in python

I have one list which contains about 400 words. And another list of lists, in which each list contains about 150,000 words. This list has 20 such lists.

Now I want to see how many of these 400 words appear in all of these 150,000 words list. I also want to know a word from this 400 words, appear how many times in 150k words list, which of these words occur most, how many times etc.

Only solution I can think of is polynomial time solution. It is a very bad solution and will be hell lot slow:

for one_list in list_of_150kwords:
    for key in 400_words:
        for word in one_list:
            if key == word:
                # count this word
                # do other stuff

This is a very ugly and bad solution, but I can't think of any better. I tried the same with NumPy by converting these lists to NumPy arrays:

list_of_150kwords = numpy.array(list_of_150kwords)
...

But I still find it very slow. Any other solution? Or any library?

like image 835
avi Avatar asked Feb 15 '14 18:02

avi


2 Answers

This sounds like a good opportunity for using a set:

set_of_150kwords = set(list_of_150kwords)
one_set = set(one_list)

len(one_set & set_of_150kwords) # set intersection is &
=> number of elements common to both sets

As per set theory, the intersection of two sets gives the elements that appear in both sets, then it's a simple matter of taking its length. For the second part (which of these words occur most, how many times etc.) Create a Counter with list_of_150kwords, That will tell you how many times each word appears in the list. And the intersection set will tell you which are the common words, solving both of your requirements.

like image 64
Óscar López Avatar answered Nov 15 '22 01:11

Óscar López


from collections import Counter

search_data = [
    ["list", "of", "150k", "words"],
    ["another", "list", "of", "150k", "words"],
    ["yet", "another", "list", "of", "150k", "words"]
    # ... 17 more of these
]

search_words = ["four", "hundred", "words", "to", "search", "for"]

def word_finder(words_to_find):
    lookfor = set(word.lower() for word in words_to_find)
    def get_word_count(text):
        return Counter(word for word in (wd.lower() for wd in text) if word in lookfor)
    return get_word_count

def get_words_in_common(counters):
    # Maybe use c.viewkeys() instead of set(c)? Which is faster?
    return reduce(operator.and_, (set(c) for c in counters))

def main():
    wordcount = word_finder(search_words)
    counters = [wordcount(wordlst) for wordlst in search_data]
    common_to_all = get_words_in_common(counters)
    print(common_to_all)

if __name__=="__main__":
    main()
like image 34
Hugh Bothwell Avatar answered Nov 15 '22 01:11

Hugh Bothwell