I have a large (about 12M rows) DataFrame df
:
df.columns = ['word','documents','frequency']
The following ran in a timely fashion:
word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']
However, this is taking an unexpectedly long time to run:
Occurrences_of_Words = word_grouping[['word']].count().reset_index()
What am I doing wrong here? Is there a better way to count occurrences in a large DataFrame?
df.word.describe()
ran pretty well, so I really did not expect this Occurrences_of_Words
DataFrame to take very long to build.
Using the size() or count() method with pandas. DataFrame. groupby() will generate the count of a number of occurrences of data present in a particular column of the dataframe.
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
To get the most frequent value of a column we can use the method mode . It will return the value that appears most often. It can be multiple values.
I think df['word'].value_counts()
should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count
should be much slower than max
. Both take some time to avoid missing values. (Compare with size
.)
In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With