Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does the group_keys argument to pandas.groupby actually do?

Tags:

python

pandas

In pandas.DataFrame.groupby, there is an argument group_keys, which I gather is supposed to do something relating to how group keys are included in the dataframe subsets. According to the documentation:

group_keys : boolean, default True

When calling apply, add group keys to index to identify pieces

However, I can't really find any examples where group_keys makes an actual difference:

import pandas as pd  df = pd.DataFrame([[0, 1, 3],                    [3, 1, 1],                    [3, 0, 0],                    [2, 3, 3],                    [2, 1, 0]], columns=list('xyz'))  gby = df.groupby('x') gby_k = df.groupby('x', group_keys=False) 

It doesn't make a difference in the output of apply:

ap = gby.apply(pd.DataFrame.sum) #    x  y  z # x          # 0  0  1  3 # 2  4  4  3 # 3  6  1  1  ap_k = gby_k.apply(pd.DataFrame.sum) #    x  y  z # x          # 0  0  1  3 # 2  4  4  3 # 3  6  1  1 

And even if you print out the grouped subsets as you go, the results are still identical:

def printer_func(x):     print(x)     return x  print('gby') print('--------------') gby.apply(printer_func) print('--------------')  print('gby_k') print('--------------') gby_k.apply(printer_func) print('--------------')  # gby # -------------- #    x  y  z # 0  0  1  3 #    x  y  z # 0  0  1  3 #    x  y  z # 3  2  3  3 # 4  2  1  0 #    x  y  z # 1  3  1  1 # 2  3  0  0 # -------------- # gby_k # -------------- #    x  y  z # 0  0  1  3 #    x  y  z # 0  0  1  3 #    x  y  z # 3  2  3  3 # 4  2  1  0 #    x  y  z # 1  3  1  1 # 2  3  0  0 # -------------- 

I considered the possibility that the default argument is actually True, but switching group_keys to explicitly False doesn't make a difference either. What exactly is this argument for?

(Run on pandas version 0.18.1)

Edit: I did find a way where group_keys changes behavior, based on this answer:

import pandas as pd import numpy as np  row_idx = pd.MultiIndex.from_product(((0, 1), (2, 3, 4))) d = pd.DataFrame([[4, 3], [1, 3], [1, 1], [2, 4], [0, 1], [4, 2]], index=row_idx)  df_n = d.groupby(level=0).apply(lambda x: x.nlargest(2, [0])) #        0  1 # 0 0 2  4  3 #     3  1  3 # 1 1 4  4  2 #     2  2  4  df_k = d.groupby(level=0, group_keys=False).apply(lambda x: x.nlargest(2, [0]))  #      0  1 # 0 2  4  3 #   3  1  3 # 1 4  4  2 #   2  2  4 

However, I'm still not clear on the intelligible principle behind what group_keys is supposed to do. This behavior does not seem intuitive based on @piRSquared's answer.

like image 633
Paul Avatar asked Aug 09 '16 17:08

Paul


People also ask

What is Group_keys in Groupby pandas?

group_keys parameter in groupby comes handy during apply operations that creates an additional index column corresponding to the grouped columns[ group_keys=True ] and eliminates in the case[ group_keys=False ] especially during the case when trying to perform operations on individual columns.

What does the Groupby function do in pandas?

groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.

What does As_index do in Groupby?

When as_index=True the key(s) you use in groupby() will become an index in the new dataframe. The benefits you get when you set the column as index are: Speed. When you filter values based on the index column eg.

What does Groupby in pandas return?

An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operations can be performed on the grouped data.


2 Answers

group_keys parameter in groupby comes handy during apply operations that creates an additional index column corresponding to the grouped columns[group_keys=True] and eliminates in the case[group_keys=False] especially during the case when trying to perform operations on individual columns.

One such instance:

In [21]: gby = df.groupby('x',group_keys=True).apply(lambda row: row['x'])  In [22]: gby Out[22]:  x    0  0    0 2  3    2    4    2 3  1    3    2    3 Name: x, dtype: int64  In [23]: gby_k = df.groupby('x', group_keys=False).apply(lambda row: row['x'])  In [24]: gby_k Out[24]:  0    0 3    2 4    2 1    3 2    3 Name: x, dtype: int64 

One of it's intended application could be to group by one of the levels of the hierarchy by converting it to a Multi-index dataframe object.

In [27]: gby.groupby(level='x').sum() Out[27]:  x 0    0 2    4 3    6 Name: x, dtype: int64 
like image 172
Nickil Maveli Avatar answered Sep 26 '22 12:09

Nickil Maveli


If you are passing a function that preserves an index, pandas tries to keep that information. But if you pass a function that removes all semblance of index information, group_keys=True allows you to keep that information.

Use this instead

f = lambda df: df.reset_index(drop=True) 

Then the different groupby

gby.apply(lambda df: df.reset_index(drop=True)) 

enter image description here

gby_k.apply(lambda df: df.reset_index(drop=True)) 

enter image description here

like image 24
piRSquared Avatar answered Sep 25 '22 12:09

piRSquared