Given a data frame that looks like this <pre class="prettyprint"><code>GROUP VALUE 1 5 2 2 1 10 2 20 1 7 </code></pre> I would like to compute the difference between the largest and smallest value within each group. That is, the result should be <pre class="prettyprint"><code>GROUP DIFF 1 5 2 18 </code></pre> What is an easy way to do this in Pandas? What is a fast way to do this in Pandas for a data frame with about 2 million rows and 1 million groups?

Using @unutbu 's <code>df</code> per timing unutbu's solution is best over large data sets <pre class="prettyprint"><code>import pandas as pd import numpy as np df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]}) df.groupby('GROUP')['VALUE'].agg(np.ptp) GROUP 1 5 2 18 Name: VALUE, dtype: int64 </code></pre> <hr> <code>np.ptp</code> docs returns the range of an array <hr> timing small <code>df</code> <img src="https://i.stack.imgur.com/pewTe.png" alt="enter image description here"> large <code>df</code> <code>df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 100, VALUE=np.random.rand(1000000)))</code> <img src="https://i.stack.imgur.com/agPEx.png" alt="enter image description here"> large <code>df</code> many groups <code>df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 10000, VALUE=np.random.rand(1000000)))</code> <img src="https://i.stack.imgur.com/eUAYl.png" alt="enter image description here">

<code>groupby/agg</code> generally performs best when you take advantage of the built-in aggregators such as <code>'max'</code> and <code>'min'</code>. So to obtain the difference, first compute the <code>max</code> and <code>min</code> and then subtract: <pre class="prettyprint"><code>import pandas as pd df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]}) result = df.groupby('GROUP')['VALUE'].agg(['max','min']) result['diff'] = result['max']-result['min'] print(result[['diff']]) </code></pre> yields <pre class="prettyprint"><code> diff GROUP 1 5 2 18 </code></pre>

Note: this will get the job done, but @piRSquared's answer has faster methods. You can use <code>groupby()</code>, <code>min()</code>, and <code>max()</code>: <pre class="prettyprint"><code>df.groupby('GROUP')['VALUE'].apply(lambda g: g.max() - g.min()) </code></pre>

Pandas: Difference between largest and smallest value within group

Tags:

python

pandas

numpy

Given a data frame that looks like this

GROUP VALUE
  1     5
  2     2
  1     10
  2     20
  1     7

I would like to compute the difference between the largest and smallest value within each group. That is, the result should be

GROUP   DIFF
  1      5
  2      18

What is an easy way to do this in Pandas?

What is a fast way to do this in Pandas for a data frame with about 2 million rows and 1 million groups?

489

asked Oct 21 '16 19:10

David

3 Answers

Using @unutbu 's df

per timing
unutbu's solution is best over large data sets

import pandas as pd
import numpy as np

df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})

df.groupby('GROUP')['VALUE'].agg(np.ptp)

GROUP
1     5
2    18
Name: VALUE, dtype: int64

np.ptp docs returns the range of an array

timing
small df

enter image description here

large df
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 100, VALUE=np.random.rand(1000000)))

enter image description here

large df
many groups
df = pd.DataFrame(dict(GROUP=np.arange(1000000) % 10000, VALUE=np.random.rand(1000000)))

enter image description here

152

answered Oct 21 '22 12:10

piRSquared

groupby/agg generally performs best when you take advantage of the built-in aggregators such as 'max' and 'min'. So to obtain the difference, first compute the max and min and then subtract:

import pandas as pd
df = pd.DataFrame({'GROUP': [1, 2, 1, 2, 1], 'VALUE': [5, 2, 10, 20, 7]})
result = df.groupby('GROUP')['VALUE'].agg(['max','min'])
result['diff'] = result['max']-result['min']
print(result[['diff']])

yields

       diff
GROUP      
1         5
2        18

answered Oct 21 '22 14:10

unutbu

Note: this will get the job done, but @piRSquared's answer has faster methods.

You can use groupby(), min(), and max():

df.groupby('GROUP')['VALUE'].apply(lambda g: g.max() - g.min())

answered Oct 21 '22 12:10

ASGM

Related questions
                            
                                Python/PIL Resize all images in a folder
                            
                                Filter out rows based on list of strings in Pandas
                            
                                Add Multiple Columns to Pandas Dataframe from Function
                            
                                How can I remove all non-numeric characters from all the values in a particular column in pandas dataframe?
                            
                                How do I to flush redis db from python redis?
                            
                                How to check if folder is empty with Python?
                            
                                How to get the value of a Django Model Field object
                            
                                Command-line options to IPython *scripts*?
                            
                                How to convert nested list of lists into a list of tuples in python 3.3?
                            
                                How does django know which migrations have been run?
                            
                                How can I get terminal output in python? [duplicate]
                            
                                Setting the fmt option in numpy.savetxt
                            
                                Selenium give file name when downloading
                            
                                How can django sql queries use case insensitive and contains at the same time?
                            
                                Alert boxes in Python?
                            
                                Is it possible to keep the column order using csv.DictReader?
                            
                                SqlAlchemy: getting the id of the last record inserted
                            
                                Flask debug=True does not work when going through uWSGI
                            
                                Does dictionary's clear() method delete all the item related objects from memory?
                            
                                Pandas: unique dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: Difference between largest and smallest value within group

Tags:

python

pandas

numpy

David

People also ask

3 Answers

piRSquared

unutbu

ASGM

Recent Activity

Donate For Us