What does the group_keys argument to pandas.groupby actually do?

Tags:

In pandas.DataFrame.groupby, there is an argument group_keys, which I gather is supposed to do something relating to how group keys are included in the dataframe subsets. According to the documentation:

group_keys : boolean, default True

When calling apply, add group keys to index to identify pieces

However, I can't really find any examples where group_keys makes an actual difference:

import pandas as pd  df = pd.DataFrame([[0, 1, 3],                    [3, 1, 1],                    [3, 0, 0],                    [2, 3, 3],                    [2, 1, 0]], columns=list('xyz'))  gby = df.groupby('x') gby_k = df.groupby('x', group_keys=False)

It doesn't make a difference in the output of apply:

ap = gby.apply(pd.DataFrame.sum) #    x  y  z # x          # 0  0  1  3 # 2  4  4  3 # 3  6  1  1  ap_k = gby_k.apply(pd.DataFrame.sum) #    x  y  z # x          # 0  0  1  3 # 2  4  4  3 # 3  6  1  1

And even if you print out the grouped subsets as you go, the results are still identical:

def printer_func(x):     print(x)     return x  print('gby') print('--------------') gby.apply(printer_func) print('--------------')  print('gby_k') print('--------------') gby_k.apply(printer_func) print('--------------')  # gby # -------------- #    x  y  z # 0  0  1  3 #    x  y  z # 0  0  1  3 #    x  y  z # 3  2  3  3 # 4  2  1  0 #    x  y  z # 1  3  1  1 # 2  3  0  0 # -------------- # gby_k # -------------- #    x  y  z # 0  0  1  3 #    x  y  z # 0  0  1  3 #    x  y  z # 3  2  3  3 # 4  2  1  0 #    x  y  z # 1  3  1  1 # 2  3  0  0 # --------------

I considered the possibility that the default argument is actually True, but switching group_keys to explicitly False doesn't make a difference either. What exactly is this argument for?

(Run on pandas version 0.18.1)

Edit: I did find a way where group_keys changes behavior, based on this answer:

import pandas as pd import numpy as np  row_idx = pd.MultiIndex.from_product(((0, 1), (2, 3, 4))) d = pd.DataFrame([[4, 3], [1, 3], [1, 1], [2, 4], [0, 1], [4, 2]], index=row_idx)  df_n = d.groupby(level=0).apply(lambda x: x.nlargest(2, [0])) #        0  1 # 0 0 2  4  3 #     3  1  3 # 1 1 4  4  2 #     2  2  4  df_k = d.groupby(level=0, group_keys=False).apply(lambda x: x.nlargest(2, [0]))  #      0  1 # 0 2  4  3 #   3  1  3 # 1 4  4  2 #   2  2  4

However, I'm still not clear on the intelligible principle behind what group_keys is supposed to do. This behavior does not seem intuitive based on @piRSquared's answer.

633

asked Aug 09 '16 17:08

Paul

2 Answers

group_keys parameter in groupby comes handy during apply operations that creates an additional index column corresponding to the grouped columns[group_keys=True] and eliminates in the case[group_keys=False] especially during the case when trying to perform operations on individual columns.

One such instance:

In [21]: gby = df.groupby('x',group_keys=True).apply(lambda row: row['x'])  In [22]: gby Out[22]:  x    0  0    0 2  3    2    4    2 3  1    3    2    3 Name: x, dtype: int64  In [23]: gby_k = df.groupby('x', group_keys=False).apply(lambda row: row['x'])  In [24]: gby_k Out[24]:  0    0 3    2 4    2 1    3 2    3 Name: x, dtype: int64

One of it's intended application could be to group by one of the levels of the hierarchy by converting it to a Multi-index dataframe object.

In [27]: gby.groupby(level='x').sum() Out[27]:  x 0    0 2    4 3    6 Name: x, dtype: int64

172

answered Sep 26 '22 12:09

Nickil Maveli

If you are passing a function that preserves an index, pandas tries to keep that information. But if you pass a function that removes all semblance of index information, group_keys=True allows you to keep that information.

Use this instead

f = lambda df: df.reset_index(drop=True)

Then the different groupby

gby.apply(lambda df: df.reset_index(drop=True))

enter image description here

gby_k.apply(lambda df: df.reset_index(drop=True))

enter image description here

answered Sep 25 '22 12:09

piRSquared

Related questions
                            
                                How do I configure the behavior of the Qt4Agg backend?
                            
                                Django & Redis: How do I properly use connection pooling?
                            
                                R internal handling of sparse matrices
                            
                                Multiple independent embedded Python Interpreters on multiple operating system threads invoked from C/C++ program
                            
                                empty dictionary as default value for keyword argument in python function: dictionary seems to not be initialised to {} on subsequent calls? [duplicate]
                            
                                How to prevent Pandas from converting my integers to floats when I merge two dataFrames?
                            
                                MacOSX Instruments to profile Python code
                            
                                Comparing SQLAlchemy Object Instances for Equality of Attributes
                            
                                Python ASCII Graph Drawing [closed]
                            
                                how to call a program from python without waiting for it to return
                            
                                Performance degradation of matrix multiplication of single vs double precision arrays on multi-core machine
                            
                                PyAudio IOError: No Default Input Device Available
                            
                                CPU Flame Graphs for Python
                            
                                easy_install : ImportError: Entry point ('console_scripts', 'easy_install') not found
                            
                                Flask slow at retrieving post data from request?
                            
                                Accessing validation data within a custom callback
                            
                                change strength of antialiasing in matplotlib
                            
                                Python equivalent of which() in R
                            
                                Python / ImportError: Import by filename is not supported [duplicate]
                            
                                How to merge two dictionaries with same key names [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What does the group_keys argument to pandas.groupby actually do?

Tags:

python

pandas

Paul

People also ask

2 Answers

Nickil Maveli

piRSquared

Recent Activity

Donate For Us