I have a large (about 12M rows) DataFrame <code>df</code>: <pre class="prettyprint"><code>df.columns = ['word','documents','frequency'] </code></pre> The following ran in a timely fashion: <pre class="prettyprint"><code>word_grouping = df[['word','frequency']].groupby('word') MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index() MaxFrequency_perWord.columns = ['word','MaxFrequency'] </code></pre> However, this is taking an unexpectedly long time to run: <pre class="prettyprint"><code>Occurrences_of_Words = word_grouping[['word']].count().reset_index() </code></pre> What am I doing wrong here? Is there a better way to count occurrences in a large DataFrame? <pre class="prettyprint"><code>df.word.describe() </code></pre> ran pretty well, so I really did not expect this <code>Occurrences_of_Words</code> DataFrame to take very long to build.

I think <code>df['word'].value_counts()</code> should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why <code>count</code> should be much slower than <code>max</code>. Both take some time to avoid missing values. (Compare with <code>size</code>.) In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.

what is the most efficient way of counting occurrences in pandas?

Tags:

python

pandas

I have a large (about 12M rows) DataFrame df:

df.columns = ['word','documents','frequency']

The following ran in a timely fashion:

word_grouping = df[['word','frequency']].groupby('word')
MaxFrequency_perWord = word_grouping[['frequency']].max().reset_index()
MaxFrequency_perWord.columns = ['word','MaxFrequency']

However, this is taking an unexpectedly long time to run:

Occurrences_of_Words = word_grouping[['word']].count().reset_index()

What am I doing wrong here? Is there a better way to count occurrences in a large DataFrame?

df.word.describe()

ran pretty well, so I really did not expect this Occurrences_of_Words DataFrame to take very long to build.

463

asked Nov 19 '13 15:11

tipanverella

1 Answers

I think df['word'].value_counts() should serve. By skipping the groupby machinery, you'll save some time. I'm not sure why count should be much slower than max. Both take some time to avoid missing values. (Compare with size.)

In any case, value_counts has been specifically optimized to handle object type, like your words, so I doubt you'll do much better than that.

121

answered Sep 27 '22 23:09

Dan Allan

Related questions
                            
                                Parsing XML with namespace in Python via 'ElementTree'
                            
                                ipython notebook clear cell output in code
                            
                                Get last result in interactive Python shell
                            
                                How to form tuple column from two columns in Pandas
                            
                                Find and replace string values in list
                            
                                How do we determine the number of days for a given month in python [duplicate]
                            
                                Django Admin - Disable the 'Add' action for a specific model
                            
                                Using numpy to build an array of all combinations of two arrays
                            
                                Tensorflow 2.0 - AttributeError: module 'tensorflow' has no attribute 'Session'
                            
                                Why does Python pep-8 strongly recommend spaces over tabs for indentation? [closed]
                            
                                Generate a random letter in Python
                            
                                Convert from ASCII string encoded in Hex to plain ASCII?
                            
                                Getting number of elements in an iterator in Python
                            
                                Python mysqldb: Library not loaded: libmysqlclient.18.dylib
                            
                                How can I share Jupyter notebooks with non-programmers? [closed]
                            
                                Matplotlib scatterplot; color as a function of a third variable
                            
                                When splitting an empty string in Python, why does split() return an empty list while split('\n') returns ['']?
                            
                                Django: Why do some model fields clash with each other?
                            
                                How do I select elements of an array given condition?
                            
                                How to expand a list to function arguments in Python [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With