Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas: Difference between largest and smallest value within group

Given a data frame that looks like this

GROUP VALUE
  1     5
  2     2
  1     10
  2     20
  1     7

I would like to compute the difference between the largest and smallest value within each group. That is, the result should be

GROUP   DIFF
  1      5
  2      18

What is an easy way to do this in Pandas?

What is a fast way to do this in Pandas for a data frame with about 2 million rows and 1 million groups?

like image 489
David Avatar asked Oct 21 '16 19:10

David


People also ask

How do you get the maximum values of each group in a pandas?

To get the maximum value of each group, you can directly apply the pandas max() function to the selected column(s) from the result of pandas groupby.

Can pandas series hold different data types?

@KorayTugay, Yes.

How do you use greater than or equal to in pandas?

Pandas DataFrame: ge() functionThe ge() function returns greater than or equal to of dataframe and other, element-wise. Equivalent to ==, =!, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison. Any single or multiple element data structure, or list-like object.

What is difference between pop and drop in pandas?

drop() will return the DataFrame with the column removed. Whereas . pop() will return the column.

How to take the N largest values by group in pandas Dataframe?

#display two largest values by group df.groupby('group_var') ['values_var'].nlargest(2) And you can use the following syntax to perform some operation (like taking the sum) on the n largest values by group in a pandas DataFrame: #find sum of two largest values by group df.groupby('group_var') ['values_var'].apply(lambda grp: grp.nlargest(2).sum())

How to find the max value by group in a Dataframe?

Often you may be interested in finding the max value by group in a pandas DataFrame. Fortunately this is easy to do using the groupby () and max () functions with the following syntax: df.groupby('column_name').max()

How to find sum of two largest values by group in Python?

#find sum of two largest values by group df.groupby('group_var') ['values_var'].apply(lambda grp: grp.nlargest(2).sum()) The following examples shows how to use each method in practice with the following pandas DataFrame:

How does groupby work in pandas?

Similar to the SQL GROUP BY statement, the Pandas method works by splitting our data, aggregating it in a given way (or ways), and re-combining the data in a meaningful way. Because the .groupby () method works by first splitting the data, we can actually work with the groups directly.


3 Answers

Using @unutbu 's df

per timing
unutbu's solution is best over large data sets

import pandas as pd
import numpy as np

df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})

df.groupby('GROUP')['VALUE'].agg(np.ptp)

GROUP
1     5
2    18
Name: VALUE, dtype: int64

np.ptp docs returns the range of an array


timing
small df

enter image description here

large df
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 100, VALUE=np.random.rand(1000000)))

enter image description here

large df
many groups
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 10000, VALUE=np.random.rand(1000000)))

enter image description here

like image 152
piRSquared Avatar answered Oct 21 '22 12:10

piRSquared


groupby/agg generally performs best when you take advantage of the built-in aggregators such as 'max' and 'min'. So to obtain the difference, first compute the max and min and then subtract:

import pandas as pd
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
result = df.groupby('GROUP')['VALUE'].agg(['max','min'])
result['diff'] = result['max']-result['min']
print(result[['diff']])

yields

       diff
GROUP      
1         5
2        18
like image 26
unutbu Avatar answered Oct 21 '22 14:10

unutbu


Note: this will get the job done, but @piRSquared's answer has faster methods.

You can use groupby(), min(), and max():

df.groupby('GROUP')['VALUE'].apply(lambda g: g.max() - g.min())
like image 15
ASGM Avatar answered Oct 21 '22 12:10

ASGM