In pandas.DataFrame.groupby
, there is an argument group_keys
, which I gather is supposed to do something relating to how group keys are included in the dataframe subsets. According to the documentation:
group_keys : boolean, default True
When calling apply, add group keys to index to identify pieces
However, I can't really find any examples where group_keys
makes an actual difference:
import pandas as pd df = pd.DataFrame([[0, 1, 3], [3, 1, 1], [3, 0, 0], [2, 3, 3], [2, 1, 0]], columns=list('xyz')) gby = df.groupby('x') gby_k = df.groupby('x', group_keys=False)
It doesn't make a difference in the output of apply
:
ap = gby.apply(pd.DataFrame.sum) # x y z # x # 0 0 1 3 # 2 4 4 3 # 3 6 1 1 ap_k = gby_k.apply(pd.DataFrame.sum) # x y z # x # 0 0 1 3 # 2 4 4 3 # 3 6 1 1
And even if you print out the grouped subsets as you go, the results are still identical:
def printer_func(x): print(x) return x print('gby') print('--------------') gby.apply(printer_func) print('--------------') print('gby_k') print('--------------') gby_k.apply(printer_func) print('--------------') # gby # -------------- # x y z # 0 0 1 3 # x y z # 0 0 1 3 # x y z # 3 2 3 3 # 4 2 1 0 # x y z # 1 3 1 1 # 2 3 0 0 # -------------- # gby_k # -------------- # x y z # 0 0 1 3 # x y z # 0 0 1 3 # x y z # 3 2 3 3 # 4 2 1 0 # x y z # 1 3 1 1 # 2 3 0 0 # --------------
I considered the possibility that the default argument is actually True
, but switching group_keys
to explicitly False
doesn't make a difference either. What exactly is this argument for?
(Run on pandas
version 0.18.1
)
Edit: I did find a way where group_keys
changes behavior, based on this answer:
import pandas as pd import numpy as np row_idx = pd.MultiIndex.from_product(((0, 1), (2, 3, 4))) d = pd.DataFrame([[4, 3], [1, 3], [1, 1], [2, 4], [0, 1], [4, 2]], index=row_idx) df_n = d.groupby(level=0).apply(lambda x: x.nlargest(2, [0])) # 0 1 # 0 0 2 4 3 # 3 1 3 # 1 1 4 4 2 # 2 2 4 df_k = d.groupby(level=0, group_keys=False).apply(lambda x: x.nlargest(2, [0])) # 0 1 # 0 2 4 3 # 3 1 3 # 1 4 4 2 # 2 2 4
However, I'm still not clear on the intelligible principle behind what group_keys
is supposed to do. This behavior does not seem intuitive based on @piRSquared's answer.
group_keys parameter in groupby comes handy during apply operations that creates an additional index column corresponding to the grouped columns[ group_keys=True ] and eliminates in the case[ group_keys=False ] especially during the case when trying to perform operations on individual columns.
groupby() function is used to split the data into groups based on some criteria. pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names.
When as_index=True the key(s) you use in groupby() will become an index in the new dataframe. The benefits you get when you set the column as index are: Speed. When you filter values based on the index column eg.
An aggregated function returns a single aggregated value for each group. Once the group by object is created, several aggregation operations can be performed on the grouped data.
group_keys
parameter in groupby
comes handy during apply
operations that creates an additional index column corresponding to the grouped columns[group_keys=True
] and eliminates in the case[group_keys=False
] especially during the case when trying to perform operations on individual columns.
One such instance:
In [21]: gby = df.groupby('x',group_keys=True).apply(lambda row: row['x']) In [22]: gby Out[22]: x 0 0 0 2 3 2 4 2 3 1 3 2 3 Name: x, dtype: int64 In [23]: gby_k = df.groupby('x', group_keys=False).apply(lambda row: row['x']) In [24]: gby_k Out[24]: 0 0 3 2 4 2 1 3 2 3 Name: x, dtype: int64
One of it's intended application could be to group by one of the levels of the hierarchy by converting it to a Multi-index
dataframe object.
In [27]: gby.groupby(level='x').sum() Out[27]: x 0 0 2 4 3 6 Name: x, dtype: int64
If you are passing a function that preserves an index, pandas tries to keep that information. But if you pass a function that removes all semblance of index information, group_keys=True
allows you to keep that information.
Use this instead
f = lambda df: df.reset_index(drop=True)
Then the different groupby
gby.apply(lambda df: df.reset_index(drop=True))
gby_k.apply(lambda df: df.reset_index(drop=True))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With