Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most Frequent Words from Sentences grouped by category

Tags:

python

pandas

I'm trying to group the 10 most frequent words by category. I've seen this answer already, but I can't quite modify that to get the output I desire.

category | sentence
  A           cat runs over big dog
  A           dog runs over big cat
  B           random sentences include words
  C           including this one

Output desired:

category | word/frequency
   A           runs, 2
               cat: 2
               dog: 2
               over: 2
               big: 2
   B           random: 1
   C           including: 1

Since my dataframe is quite large I'd like to only get the top 10 most recurring words. I've also seen this answer

df.groupby('subreddit').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))

but this method returns counts of letters as well.

like image 729
Curious Student Avatar asked Dec 13 '22 14:12

Curious Student


1 Answers

Should you want filter by frequency of the most occurring words the following line will do (2 most frequently occurring words for each category, in this case):

from collections import Counter

df.groupby("category")["sentence"].apply(lambda x: Counter(" ".join(x).split()).most_common(2))

category
A            [(cat, 2), (runs, 2)]
B    [(random, 1), (sentences, 1)]
C      [(including, 1), (this, 1)]
Name: sentence, dtype: object

Performance wise:

%timeit df.groupby("category")["sentence"].apply(lambda x: Counter(" ".join(x).split()).most_common(2))
2.07 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.groupby('category')['sentence'].apply(lambda x: nltk.FreqDist(nltk.tokenize.word_tokenize(' '.join(x))))
4.96 ms ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
like image 55
Sergey Bushmanov Avatar answered Jan 19 '23 02:01

Sergey Bushmanov