Given a data frame that looks like this
GROUP VALUE
1 5
2 2
1 10
2 20
1 7
I would like to compute the difference between the largest and smallest value within each group. That is, the result should be
GROUP DIFF
1 5
2 18
What is an easy way to do this in Pandas?
What is a fast way to do this in Pandas for a data frame with about 2 million rows and 1 million groups?
To get the maximum value of each group, you can directly apply the pandas max() function to the selected column(s) from the result of pandas groupby.
@KorayTugay, Yes.
Pandas DataFrame: ge() functionThe ge() function returns greater than or equal to of dataframe and other, element-wise. Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison. Any single or multiple element data structure, or list-like object.
drop() will return the DataFrame with the column removed. Whereas . pop() will return the column.
#display two largest values by group df.groupby('group_var') ['values_var'].nlargest(2) And you can use the following syntax to perform some operation (like taking the sum) on the n largest values by group in a pandas DataFrame: #find sum of two largest values by group df.groupby('group_var') ['values_var'].apply(lambda grp: grp.nlargest(2).sum())
Often you may be interested in finding the max value by group in a pandas DataFrame. Fortunately this is easy to do using the groupby () and max () functions with the following syntax: df.groupby('column_name').max()
#find sum of two largest values by group df.groupby('group_var') ['values_var'].apply(lambda grp: grp.nlargest(2).sum()) The following examples shows how to use each method in practice with the following pandas DataFrame:
Similar to the SQL GROUP BY statement, the Pandas method works by splitting our data, aggregating it in a given way (or ways), and re-combining the data in a meaningful way. Because the .groupby () method works by first splitting the data, we can actually work with the groups directly.
Using @unutbu 's df
per timing
unutbu's solution is best over large data sets
import pandas as pd
import numpy as np
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
df.groupby('GROUP')['VALUE'].agg(np.ptp)
GROUP
1 5
2 18
Name: VALUE, dtype: int64
np.ptp
docs returns the range of an array
timing
small df
large df
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 100, VALUE=np.random.rand(1000000)))
large df
many groupsdf = pd.DataFrame(dict(GROUP=np.arange(1000000) % 10000, VALUE=np.random.rand(1000000)))
groupby/agg
generally performs best when you take advantage of the built-in aggregators such as 'max'
and 'min'
. So to obtain the difference, first compute the max
and min
and then subtract:
import pandas as pd
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
result = df.groupby('GROUP')['VALUE'].agg(['max','min'])
result['diff'] = result['max']-result['min']
print(result[['diff']])
yields
diff
GROUP
1 5
2 18
Note: this will get the job done, but @piRSquared's answer has faster methods.
You can use groupby()
, min()
, and max()
:
df.groupby('GROUP')['VALUE'].apply(lambda g: g.max() - g.min())
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With