Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pandas Dataframe groupby aggregate functions and difference between max and min of a column on the fly

import pandas as pd

df = {'a': ['xxx', 'xxx','xxx','yyy','yyy','yyy'], 'start': [10000, 10500, 11000, 12000, 13000, 14000] }
df = pd.DataFrame(data=df)


df_new = df.groupby("a",as_index=True).agg(
            ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
            StartMin=pd.NamedAgg(column='start', aggfunc="min"),
            StartMax=pd.NamedAgg(column='start', aggfunc="max"),
            )

gives

>>>df_new
     ProcessiveGroupLength  StartMin  StartMax
a
xxx                      3     10000     11000
yyy                      3     12000     14000

How to get below on the fly, since I think on the fly it will be faster.

>>>df_new
     ProcessiveGroupLength    Diff
a
xxx                      3      1000
yyy                      3      2000

Below code gives the following error message:

Traceback (most recent call last): File "", line 5, in TypeError: unsupported operand type(s) for -: 'str' and 'str'

df_new = df.groupby("a").agg(
            ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),                
            Diff=pd.NamedAgg(column='start', aggfunc="max"-"min"),)
like image 503
burcak Avatar asked Sep 17 '20 04:09

burcak


People also ask

How do you use Groupby and Max in pandas?

Pandas Groupby Maximum The following is a step-by-step guide of what you need to do. Group the dataframe on the column(s) you want. Select the field(s) for which you want to estimate the maximum. Apply the pandas max() function directly or pass 'max' to the agg() function.

What does Groupby Max do?

max. Compute max of group values. Include only float, int, boolean columns.

Can you use Groupby with multiple columns in pandas?

How to groupby multiple columns in pandas DataFrame and compute multiple aggregations? groupby() can take the list of columns to group by multiple columns and use the aggregate functions to apply single or multiple aggregations at the same time.


1 Answers

Your solution should be changed by lambda function, but I think if many groups or/and large DataFrame this should be slowier like first solution.

Reason is optimalized functions max and min and also vectorized subtraction of Series. In another words if not used lambda functions aggregations is faster.

df_new = df.groupby("a").agg(
            ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
            Diff=pd.NamedAgg(column='start', aggfunc=lambda x: x.max() - x.min()),)

Or yu can use numpy.ptp:

df_new = df.groupby("a").agg(
            ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
            Diff=pd.NamedAgg(column='start', aggfunc=np.ptp),)

print (df_new)
     ProcessiveGroupLength  Diff
a                               
xxx                      3  1000
yyy                      3  2000

Performance: Depends of data, here is used 1k groups in 1M rows:

np.random.seed(20)

N = 1000000
df = pd.DataFrame({'a': np.random.randint(1000, size=N),
                   'start':np.random.randint(10000, size=N)})
print (df)

In [229]: %%timeit
     ...: df_new = df.groupby("a",as_index=True).agg(
     ...:             ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
     ...:             StartMin=pd.NamedAgg(column='start', aggfunc="min"),
     ...:             StartMax=pd.NamedAgg(column='start', aggfunc="max"),
     ...:             ).assign(Diff = lambda x: x.pop('StartMax') - x.pop('StartMin'))
     ...:             
69 ms ± 728 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [230]: %%timeit
     ...: df_new = df.groupby("a").agg(
     ...:             ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
     ...:             Diff=pd.NamedAgg(column='start', aggfunc=lambda x: x.max() - x.min()),)
     ...:             
172 ms ± 1.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [231]: %%timeit
     ...: df_new = df.groupby("a").agg(
     ...:             ProcessiveGroupLength=pd.NamedAgg(column='start', aggfunc="count"),
     ...:             Diff=pd.NamedAgg(column='start', aggfunc=np.ptp),)
     ...:             
171 ms ± 3.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 162
jezrael Avatar answered Nov 24 '22 18:11

jezrael