Why is groupby so fast?

Tags:

This is a follow up question to this one, where jezrael used pandas.DataFrame.groupby to increment by a factor of some hundreds the speed of a list creation. Specifically, let df be a large dataframe, then

index = list(set(df.index))
list_df = [df.loc(x) for x in index]

and

list_df = [x for i,x in df.groupby(level=0, sort=False)]

produce the same result, with the latter being more than 200 times faster than the former, even ignoring the list creation step. Why?

I would be very glad if someone could let me understand why there is such a massive performance difference. Thanks in advance!

Edit: as suggested by Alex Riley in his comment, I confirm that the tests have been run on a dataframe with non-unique and non-monotonic index.

669

asked Oct 10 '17 15:10

Giovanni De Gaetano

1 Answers

Because your data frame is not sorted on the index, which means all the subsetting has to be done with slow vector scan and fast algorithm like binary search can not be applied; While groupby always sort the data frame by the group by variable first, you can mimic this behavior by writing a simple algorithm that sort the index and then subset to validate this:

def sort_subset(df):
    # sort index and find out the positions that separate groups
    df = df.sort_index()
    split_indices = np.flatnonzero(np.ediff1d(df.index, to_begin=1, to_end=1))
    list_df = []
    for i in range(len(split_indices)-1):
        start_index = split_indices[i]
        end_index = split_indices[i+1]
        list_df.append(df.iloc[start_index:end_index])
    return list_df

Some timing:

import pandas as pd
import numpy as np

nrow = 1000000
df = pd.DataFrame(np.random.randn(nrow), columns=['x'], index=np.random.randint(100, size=nrow))

index = list(set(df.index))
print('no of groups: ', len(index))

%timeit list_df_1 = [df.loc[x] for x in index]
#no of groups:  100
#13.6 s ± 228 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit list_df_2 = [x for i, x in df.groupby(level=0, sort=False)]
#54.8 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Not as fast because my algorithm is not optimized at all but the same order of magnitude
%timeit list_df_3 = sort_subset(df)
#102 ms ± 3.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

list_df_1 = [df.loc[x] for x in index]
list_df_2 = [x for i, x in df.groupby(level=0, sort=False)]
list_df_3 = sort_subset(df)

Compare the result:

all(list_df_3[i].eq(list_df_2[i]).all().iat[0] for i in range(len(list_df_2)))
# True

You see a significant speed up if you sort the index before subsetting as well:

def sort_subset_with_loc(df):
    df = df.sort_index()
    list_df_1 = [df.loc[x] for x in index]
    return list_df_1

%timeit sort_subset_with_loc(df)
# 25.4 ms ± 897 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

134

answered Nov 09 '22 23:11

Psidom

Related questions
                            
                                Why does pickle protocol 2 let me serialise an open file object?
                            
                                Python performance vs PHP [closed]
                            
                                Making custom containers work with **kwargs (how does Python expand the args?)
                            
                                Cartesian product of large iterators (itertools)
                            
                                Learning and using augmented Bayes classifiers in python
                            
                                Python tools for out-of-core computation/data mining
                            
                                How to duplicate rows in pandas, based on items in a list [duplicate]
                            
                                Pandas DataFrame.unstack() Changes Order of Row and Column Headers
                            
                                How Do I Queue My Python Locks?
                            
                                Processing an image of a table to get data from it
                            
                                Using Python+Theano with OpenCL in an AMD GPU
                            
                                How to package a linked DLL and a pyd file into one self contained pyd file?
                            
                                What to do with missing values when plotting with seaborn?
                            
                                How to slice a list of tuples in python?
                            
                                Tensorflow: How to modify the value in tensor
                            
                                How to have list() consume __iter__ without calling __len__?
                            
                                How do I run celery status/flower without the -A option?
                            
                                How `[System.Console]::OutputEncoding/InputEncoding` with Python?
                            
                                Performance of map vs starmap?
                            
                                Gmail blocks login attempt from Python with app specific password

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is groupby so fast?

Tags:

performance

python

pandas

dataframe

pandas-groupby

Giovanni De Gaetano

People also ask

1 Answers

Psidom

Recent Activity

Donate For Us