Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is groupby so fast?

This is a follow up question to this one, where jezrael used pandas.DataFrame.groupby to increment by a factor of some hundreds the speed of a list creation. Specifically, let df be a large dataframe, then

index = list(set(df.index))
list_df = [df.loc(x) for x in index]

and

list_df = [x for i,x in df.groupby(level=0, sort=False)]

produce the same result, with the latter being more than 200 times faster than the former, even ignoring the list creation step. Why?

I would be very glad if someone could let me understand why there is such a massive performance difference. Thanks in advance!

Edit: as suggested by Alex Riley in his comment, I confirm that the tests have been run on a dataframe with non-unique and non-monotonic index.

like image 669
Giovanni De Gaetano Avatar asked Oct 10 '17 15:10

Giovanni De Gaetano


People also ask

Is GROUP BY faster?

GROUP BY is slightly faster than SELECT DISTINCT The slower the drive, the bigger the difference. This may not be a universal rule (we tested simple cases only) The difference is hardly noticeable to users.

Why is Pandas GROUP BY fast?

transform with user-defined functions, Pandas is much faster with common functions like mean and sum because they are implemented in Cython. The speed differences are not small. The current version of Groupby can handle multi-dimensional outputs. It also checks if data is already sorted and runs faster if it is.

Is GROUP BY a panda?

Pandas groupby is used for grouping the data according to the categories and apply a function to the categories. It also helps to aggregate data efficiently. Pandas dataframe. groupby() function is used to split the data into groups based on some criteria.

Why is GROUP BY used in data analysis?

The groupby is one of the most frequently used Pandas functions in data analysis. It is used for grouping the data points (i.e. rows) based on the distinct values in the given column or columns. We can then calculate aggregated values for the generated groups.


1 Answers

Because your data frame is not sorted on the index, which means all the subsetting has to be done with slow vector scan and fast algorithm like binary search can not be applied; While groupby always sort the data frame by the group by variable first, you can mimic this behavior by writing a simple algorithm that sort the index and then subset to validate this:

def sort_subset(df):
    # sort index and find out the positions that separate groups
    df = df.sort_index()
    split_indices = np.flatnonzero(np.ediff1d(df.index, to_begin=1, to_end=1))
    list_df = []
    for i in range(len(split_indices)-1):
        start_index = split_indices[i]
        end_index = split_indices[i+1]
        list_df.append(df.iloc[start_index:end_index])
    return list_df

Some timing:

import pandas as pd
import numpy as np
​
nrow = 1000000
df = pd.DataFrame(np.random.randn(nrow), columns=['x'], index=np.random.randint(100, size=nrow))

index = list(set(df.index))
print('no of groups: ', len(index))
​
%timeit list_df_1 = [df.loc[x] for x in index]
#no of groups:  100
#13.6 s ± 228 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit list_df_2 = [x for i, x in df.groupby(level=0, sort=False)]
#54.8 ms ± 1.36 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# Not as fast because my algorithm is not optimized at all but the same order of magnitude
%timeit list_df_3 = sort_subset(df)
#102 ms ± 3.53 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

list_df_1 = [df.loc[x] for x in index]
list_df_2 = [x for i, x in df.groupby(level=0, sort=False)]
list_df_3 = sort_subset(df)

Compare the result:

all(list_df_3[i].eq(list_df_2[i]).all().iat[0] for i in range(len(list_df_2)))
# True

You see a significant speed up if you sort the index before subsetting as well:

def sort_subset_with_loc(df):
    df = df.sort_index()
    list_df_1 = [df.loc[x] for x in index]
    return list_df_1

%timeit sort_subset_with_loc(df)
# 25.4 ms ± 897 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 134
Psidom Avatar answered Nov 09 '22 23:11

Psidom