I have one list which contains about 400 words. And another list of lists, in which each list contains about 150,000 words. This list has 20 such lists.
Now I want to see how many of these 400 words appear in all of these 150,000 words list. I also want to know a word from this 400 words, appear how many times in 150k words list, which of these words occur most, how many times etc.
Only solution I can think of is polynomial time solution. It is a very bad solution and will be hell lot slow:
for one_list in list_of_150kwords:
for key in 400_words:
for word in one_list:
if key == word:
# count this word
# do other stuff
This is a very ugly and bad solution, but I can't think of any better. I tried the same with NumPy by converting these lists to NumPy arrays:
list_of_150kwords = numpy.array(list_of_150kwords)
...
But I still find it very slow. Any other solution? Or any library?
This sounds like a good opportunity for using a set
:
set_of_150kwords = set(list_of_150kwords)
one_set = set(one_list)
len(one_set & set_of_150kwords) # set intersection is &
=> number of elements common to both sets
As per set theory, the intersection of two sets gives the elements that appear in both sets, then it's a simple matter of taking its length. For the second part (which of these words occur most, how many times etc.) Create a Counter
with list_of_150kwords
, That will tell you how many times each word appears in the list. And the intersection set will tell you which are the common words, solving both of your requirements.
from collections import Counter
search_data = [
["list", "of", "150k", "words"],
["another", "list", "of", "150k", "words"],
["yet", "another", "list", "of", "150k", "words"]
# ... 17 more of these
]
search_words = ["four", "hundred", "words", "to", "search", "for"]
def word_finder(words_to_find):
lookfor = set(word.lower() for word in words_to_find)
def get_word_count(text):
return Counter(word for word in (wd.lower() for wd in text) if word in lookfor)
return get_word_count
def get_words_in_common(counters):
# Maybe use c.viewkeys() instead of set(c)? Which is faster?
return reduce(operator.and_, (set(c) for c in counters))
def main():
wordcount = word_finder(search_words)
counters = [wordcount(wordlst) for wordlst in search_data]
common_to_all = get_words_in_common(counters)
print(common_to_all)
if __name__=="__main__":
main()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With