Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GroupBy pandas DataFrame and select most common value

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to group by data frame by first two columns and select most common value of the third column for each combination.

My code:

import pandas as pd from scipy import stats  source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],                    'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],                   'Short name' : ['NY','New','Spb','NY']})  print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0]) 

Last line of code doesn't work, it says "Key error 'Short name'" and if I try to group only by City, then I got an AssertionError. What can I do fix it?

like image 755
Viacheslav Nefedov Avatar asked Mar 05 '13 11:03

Viacheslav Nefedov


People also ask

How do I find the most common value in a column in Python?

To sum the number of times an element or number appears, Python's value_counts() function is used. The mode() method can then be used to get the most often occurring element.

Is Iterrows faster than apply?

By using apply and specifying one as the axis, we can run a function on every row of a dataframe. This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes.

How do you get Groupby and average in pandas?

Pandas Groupby Mean To get the average (or mean) value of in each group, you can directly apply the pandas mean() function to the selected columns from the result of pandas groupby.


1 Answers

Pandas >= 0.16

pd.Series.mode is available!

Use groupby, GroupBy.agg, and apply the pd.Series.mode function to each group:

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)  Country  City             Russia   Sankt-Petersburg    Spb USA      New-York             NY Name: Short name, dtype: object 

If this is needed as a DataFrame, use

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()                           Short name Country City                        Russia  Sankt-Petersburg        Spb USA     New-York                 NY 

The useful thing about Series.mode is that it always returns a Series, making it very compatible with agg and apply, especially when reconstructing the groupby output. It is also faster.

# Accepted answer. %timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0]) # Proposed in this post. %timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)  5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 

Dealing with Multiple Modes

Series.mode also does a good job when there are multiple modes:

source2 = source.append(     pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),     ignore_index=True)  # Now `source2` has two modes for the  # ("USA", "New-York") group, they are "NY" and "New". source2    Country              City Short name 0     USA          New-York         NY 1     USA          New-York        New 2  Russia  Sankt-Petersburg        Spb 3     USA          New-York         NY 4     USA          New-York        New 

source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)  Country  City             Russia   Sankt-Petersburg          Spb USA      New-York            [NY, New] Name: Short name, dtype: object 

Or, if you want a separate row for each mode, you can use GroupBy.apply:

source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)  Country  City                Russia   Sankt-Petersburg  0    Spb USA      New-York          0     NY                            1    New Name: Short name, dtype: object 

If you don't care which mode is returned as long as it's either one of them, then you will need a lambda that calls mode and extracts the first result.

source2.groupby(['Country','City'])['Short name'].agg(     lambda x: pd.Series.mode(x)[0])  Country  City             Russia   Sankt-Petersburg    Spb USA      New-York             NY Name: Short name, dtype: object 

Alternatives to (not) consider

You can also use statistics.mode from python, but...

source.groupby(['Country','City'])['Short name'].apply(statistics.mode)  Country  City             Russia   Sankt-Petersburg    Spb USA      New-York             NY Name: Short name, dtype: object 

...it does not work well when having to deal with multiple modes; a StatisticsError is raised. This is mentioned in the docs:

If data is empty, or if there is not exactly one most common value, StatisticsError is raised.

But you can see for yourself...

statistics.mode([1, 2]) # --------------------------------------------------------------------------- # StatisticsError                           Traceback (most recent call last) # ... # StatisticsError: no unique mode; found 2 equally common values 
like image 104
cs95 Avatar answered Oct 22 '22 03:10

cs95