I'm trying to group the 10 most frequent words by category. I've seen this answer already, but I can't quite modify that to get the output I desire.
category | sentence
A cat runs over big dog
A dog runs over big cat
B random sentences include words
C including this one
Output desired:
category | word/frequency
A runs, 2
cat: 2
dog: 2
over: 2
big: 2
B random: 1
C including: 1
Since my dataframe is quite large I'd like to only get the top 10 most recurring words. I've also seen this answer
df.groupby('subreddit').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))
but this method returns counts of letters as well.
Should you want filter by frequency of the most occurring words the following line will do (2 most frequently occurring words for each category, in this case):
from collections import Counter
df.groupby("category")["sentence"].apply(lambda x: Counter(" ".join(x).split()).most_common(2))
category
A [(cat, 2), (runs, 2)]
B [(random, 1), (sentences, 1)]
C [(including, 1), (this, 1)]
Name: sentence, dtype: object
Performance wise:
%timeit df.groupby("category")["sentence"].apply(lambda x: Counter(" ".join(x).split()).most_common(2))
2.07 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.groupby('category')['sentence'].apply(lambda x: nltk.FreqDist(nltk.tokenize.word_tokenize(' '.join(x))))
4.96 ms ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With