I have a data frame with three string columns. I know that the only one value in the 3rd column is valid for every combination of the first two. To clean the data I have to group by data frame by first two columns and select most common value of the third column for each combination.
My code:
import pandas as pd from scipy import stats source = pd.DataFrame({'Country' : ['USA', 'USA', 'Russia','USA'], 'City' : ['New-York', 'New-York', 'Sankt-Petersburg', 'New-York'], 'Short name' : ['NY','New','Spb','NY']}) print source.groupby(['Country','City']).agg(lambda x: stats.mode(x['Short name'])[0])
Last line of code doesn't work, it says "Key error 'Short name'" and if I try to group only by City, then I got an AssertionError. What can I do fix it?
To sum the number of times an element or number appears, Python's value_counts() function is used. The mode() method can then be used to get the most often occurring element.
By using apply and specifying one as the axis, we can run a function on every row of a dataframe. This solution also uses looping to get the job done, but apply has been optimized better than iterrows , which results in faster runtimes.
Pandas Groupby Mean To get the average (or mean) value of in each group, you can directly apply the pandas mean() function to the selected columns from the result of pandas groupby.
pd.Series.mode
is available!Use groupby
, GroupBy.agg
, and apply the pd.Series.mode
function to each group:
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode) Country City Russia Sankt-Petersburg Spb USA New-York NY Name: Short name, dtype: object
If this is needed as a DataFrame, use
source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode).to_frame() Short name Country City Russia Sankt-Petersburg Spb USA New-York NY
The useful thing about Series.mode
is that it always returns a Series, making it very compatible with agg
and apply
, especially when reconstructing the groupby output. It is also faster.
# Accepted answer. %timeit source.groupby(['Country','City']).agg(lambda x:x.value_counts().index[0]) # Proposed in this post. %timeit source.groupby(['Country','City'])['Short name'].agg(pd.Series.mode) 5.56 ms ± 343 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 2.76 ms ± 387 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Series.mode
also does a good job when there are multiple modes:
source2 = source.append( pd.Series({'Country': 'USA', 'City': 'New-York', 'Short name': 'New'}), ignore_index=True) # Now `source2` has two modes for the # ("USA", "New-York") group, they are "NY" and "New". source2 Country City Short name 0 USA New-York NY 1 USA New-York New 2 Russia Sankt-Petersburg Spb 3 USA New-York NY 4 USA New-York New
source2.groupby(['Country','City'])['Short name'].agg(pd.Series.mode) Country City Russia Sankt-Petersburg Spb USA New-York [NY, New] Name: Short name, dtype: object
Or, if you want a separate row for each mode, you can use GroupBy.apply
:
source2.groupby(['Country','City'])['Short name'].apply(pd.Series.mode) Country City Russia Sankt-Petersburg 0 Spb USA New-York 0 NY 1 New Name: Short name, dtype: object
If you don't care which mode is returned as long as it's either one of them, then you will need a lambda that calls mode
and extracts the first result.
source2.groupby(['Country','City'])['Short name'].agg( lambda x: pd.Series.mode(x)[0]) Country City Russia Sankt-Petersburg Spb USA New-York NY Name: Short name, dtype: object
You can also use statistics.mode
from python, but...
source.groupby(['Country','City'])['Short name'].apply(statistics.mode) Country City Russia Sankt-Petersburg Spb USA New-York NY Name: Short name, dtype: object
...it does not work well when having to deal with multiple modes; a StatisticsError
is raised. This is mentioned in the docs:
If data is empty, or if there is not exactly one most common value, StatisticsError is raised.
But you can see for yourself...
statistics.mode([1, 2]) # --------------------------------------------------------------------------- # StatisticsError Traceback (most recent call last) # ... # StatisticsError: no unique mode; found 2 equally common values
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With