Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count strings in nested list

Tags:

python

list

I have a list of lists as follows.

sentences = [
    ["my", "first", "question", "in", "stackoverflow", "is", "my", "favorite"], 
    ["my", "favorite", "language", "is", "python"]
]

I want to get the count of each word in the sentences list. So, my output should look as follows.

{
    'stackoverflow': 1,
     'question': 1,
     'is': 2,
     'language': 1,
     'first': 1,
     'in': 1,
     'favorite': 2,
     'python': 1,
     'my': 3
}

I am currently doing it as follows.

frequency_input = [item for sublist in sentences for item in sublist]
frequency_output = dict(
    (x,frequency_input.count(x)) 
    for x in set(frequency_input)
)

However, it is not efficient at all for long lists. I have a really long list with about 1 million sentences in the list. It took me two days to run it, and it is still running.

In that case I would like to make my programme more efficient. My current first line of the code is O(n^2) and my second line is O(n). Please let me know if there is a more efficient way of doing this in python. It would be reaaly ideal if I could run it with lesser time than now. I am not worried about space complexity.

I am happy to provide more details if needed.

like image 520
EmJ Avatar asked Sep 06 '19 07:09

EmJ


2 Answers

A simpler and more performant approach would be to flatten the lists using itertools.chain, and to count the strings with collections.Counter:

from collections import Counter
from itertools import chain

Counter(chain.from_iterable(sentences))

Counter({'my': 3,
         'first': 1,
         'question': 1,
         'in': 1,
         'stackoverflow': 1,
         'is': 2,
         'favorite': 2,
         'language': 1,
         'python': 1})
like image 173
yatu Avatar answered Nov 05 '22 07:11

yatu


You can use Counter class from collections module.

if you want to learn the number of words in each sentence separately you can do as follows

from collections import Counter

sentences = [["my", "first", "question", "in", "stackoverflow", "is", "my", "favorite"], ["my", "favorite", "language", "is", "python"]]

counter_list = [dict(Counter(sentence)) for sentence in sentences]
print(counter_list)

Output:

[{'my': 2, 'first': 1, 'question': 1, 'in': 1, 'stackoverflow': 1, 'is': 1, 'favorite': 1}, {'my': 1, 'favorite': 1, 'language': 1, 'is': 1, 'python': 1}]

Or if you want total word counts you can use chain method from itertools module.

import itertools
from collections import Counter

sentences = [["my", "first", "question", "in", "stackoverflow", "is", "my", "favorite"], ["my", "favorite", "language", "is", "python"]]

sentences = list(itertools.chain.from_iterable(sentences))
word_counts = Counter(sentences)
print(word_counts)

Output:

Counter({'my': 3, 'is': 2, 'favorite': 2, 'first': 1, 'question': 1, 'in': 1, 'stackoverflow': 1, 'language': 1, 'python': 1})

The complexity of Counter object as documentation show, Counter is a dict subclass for counting hashable objects. So constructing counter object from an iterable has the time complexity of O(n)

like image 35
EmreAydin Avatar answered Nov 05 '22 06:11

EmreAydin