GroupBy pandas DataFrame and select most common value

Tags:

I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to group by data frame by first two columns and select most common value of the third column for each combination.

My code:

import pandas as pd from scipy import stats  source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'],                    'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'],                   'Short name' : ['NY','New','Spb','NY']})  print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])

Last line of code doesn't work, it says "Key error 'Short name'" and if I try to group only by City, then I got an AssertionError. What can I do fix it?

755

asked Mar 05 '13 11:03

Viacheslav Nefedov

1 Answers

Pandas >= 0.16

`pd.Series.mode` is available!

Use groupby, GroupBy.agg, and apply the pd.Series.mode function to each group:

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)  Country  City             Russia   Sankt-Petersburg    Spb USA      New-York             NY Name: Short name, dtype: object

If this is needed as a DataFrame, use

source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame()                           Short name Country City                        Russia  Sankt-Petersburg        Spb USA     New-York                 NY

The useful thing about Series.mode is that it always returns a Series, making it very compatible with agg and apply, especially when reconstructing the groupby output. It is also faster.

# Accepted answer. %timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0]) # Proposed in this post. %timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)  5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Dealing with Multiple Modes

Series.mode also does a good job when there are multiple modes:

source2 = source.append(     pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}),     ignore_index=True)  # Now `source2` has two modes for the  # ("USA", "New-York") group, they are "NY" and "New". source2    Country              City Short name 0     USA          New-York         NY 1     USA          New-York        New 2  Russia  Sankt-Petersburg        Spb 3     USA          New-York         NY 4     USA          New-York        New

source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode)  Country  City             Russia   Sankt-Petersburg          Spb USA      New-York            [NY, New] Name: Short name, dtype: object

Or, if you want a separate row for each mode, you can use GroupBy.apply:

source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode)  Country  City                Russia   Sankt-Petersburg  0    Spb USA      New-York          0     NY                            1    New Name: Short name, dtype: object

If you don't care which mode is returned as long as it's either one of them, then you will need a lambda that calls mode and extracts the first result.

source2.groupby(['Country','City'])['Short name'].agg(     lambda x: pd.Series.mode(x)[0])  Country  City             Russia   Sankt-Petersburg    Spb USA      New-York             NY Name: Short name, dtype: object

Alternatives to (not) consider

You can also use statistics.mode from python, but...

source.groupby(['Country','City'])['Short name'].apply(statistics.mode)  Country  City             Russia   Sankt-Petersburg    Spb USA      New-York             NY Name: Short name, dtype: object

...it does not work well when having to deal with multiple modes; a StatisticsError is raised. This is mentioned in the docs:

If data is empty, or if there is not exactly one most common value, StatisticsError is raised.

But you can see for yourself...

statistics.mode([1, 2]) # --------------------------------------------------------------------------- # StatisticsError                           Traceback (most recent call last) # ... # StatisticsError: no unique mode; found 2 equally common values

104

answered Oct 22 '22 03:10

cs95

Related questions
                            
                                What is a None value?
                            
                                Complex numbers in python
                            
                                What is the difference between class and instance attributes?
                            
                                Iterating Over Dictionary Key Values Corresponding to List in Python
                            
                                How should I read a file line-by-line in Python?
                            
                                How to extract the year from a Python datetime object?
                            
                                How to "select distinct" across multiple data frame columns in pandas?
                            
                                Very Long If Statement in Python [duplicate]
                            
                                How to set class attribute with await in __init__
                            
                                unbound method f() must be called with fibo_ instance as first argument (got classobj instance instead)
                            
                                Binning a column with Python Pandas
                            
                                Given a URL to a text file, what is the simplest way to read the contents of the text file?
                            
                                Split string using a newline delimiter with Python
                            
                                [] and {} vs list() and dict(), which is better?
                            
                                how do you filter pandas dataframes by multiple columns
                            
                                How to hash a string into 8 digits?
                            
                                sql "LIKE" equivalent in django query
                            
                                Bundling data files with PyInstaller (--onefile)
                            
                                Getting user input [duplicate]
                            
                                How do I get a raw, compiled SQL query from a SQLAlchemy expression?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

GroupBy pandas DataFrame and select most common value

Tags:

python

pandas

group-by

mode

pandas-groupby

Viacheslav Nefedov

People also ask

1 Answers

Pandas >= 0.16

`pd.Series.mode` is available!

Dealing with Multiple Modes

Alternatives to (not) consider

cs95

Recent Activity

Donate For Us

GroupBy pandas DataFrame and select most common value

Tags:

python

pandas

group-by

mode

pandas-groupby

Viacheslav Nefedov

People also ask

1 Answers

Pandas >= 0.16

pd.Series.mode is available!

Dealing with Multiple Modes

Alternatives to (not) consider

cs95

Related questions

Recent Activity

Donate For Us

`pd.Series.mode` is available!