Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Most Frequent Words from Sentences grouped by category




I'm trying to group the 10 most frequent words by category. I've seen this answer already, but I can't quite modify that to get the output I desire.

category | sentence
  A           cat runs over big dog
  A           dog runs over big cat
  B           random sentences include words
  C           including this one

Output desired:

category | word/frequency
   A           runs, 2
               cat: 2
               dog: 2
               over: 2
               big: 2
   B           random: 1
   C           including: 1

Since my dataframe is quite large I'd like to only get the top 10 most recurring words. I've also seen this answer

df.groupby('subreddit').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))

but this method returns counts of letters as well.

like image 729
Curious Student Avatar asked Dec 13 '22 14:12

Curious Student

1 Answers

Should you want filter by frequency of the most occurring words the following line will do (2 most frequently occurring words for each category, in this case):

from collections import Counter

df.groupby("category")["sentence"].apply(lambda x: Counter(" ".join(x).split()).most_common(2))

A            [(cat, 2), (runs, 2)]
B    [(random, 1), (sentences, 1)]
C      [(including, 1), (this, 1)]
Name: sentence, dtype: object

Performance wise:

%timeit df.groupby("category")["sentence"].apply(lambda x: Counter(" ".join(x).split()).most_common(2))
2.07 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.groupby('category')['sentence'].apply(lambda x: nltk.FreqDist(nltk.tokenize.word_tokenize(' '.join(x))))
4.96 ms ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 55
Sergey Bushmanov Avatar answered Jan 19 '23 02:01

Sergey Bushmanov