I'm looking for a way to get a list of all the keys in a GroupBy object, but I can't seem to find one via the docs nor through Google. There is definitely a way to access the groups through their keys, like so: <pre class="prettyprint"><code>df_gb = df.groupby(['EmployeeNumber']) df_gb.get_group(key) </code></pre> ...so I figure there's a way to access a list (or the like) of the keys in a GroupBy object. I'm looking for something like this: <pre class="prettyprint"><code>df_gb.keys Out: [1234, 2356, 6894, 9492] </code></pre> I figure I could just loop through the GroupBy object and get the keys that way, but I think there's got to be a better way.

You can access this via attribute <code>.groups</code> on the <code>groupby</code> object, this returns a dict, the keys of the dict gives you the groups: <pre class="prettyprint"><code>In [40]: df = pd.DataFrame({'group':[0,1,1,1,2,2,3,3,3], 'val':np.arange(9)}) gp = df.groupby('group') gp.groups.keys() Out[40]: dict_keys([0, 1, 2, 3]) </code></pre> here is the output from <code>groups</code>: <pre class="prettyprint"><code>In [41]: gp.groups Out[41]: {0: Int64Index([0], dtype='int64'), 1: Int64Index([1, 2, 3], dtype='int64'), 2: Int64Index([4, 5], dtype='int64'), 3: Int64Index([6, 7, 8], dtype='int64')} </code></pre> Update it looks like that because the type of <code>groups</code> is a <code>dict</code> then the group order isn't maintained when you call <code>keys</code>: <pre class="prettyprint"><code>In [65]: df = pd.DataFrame({'group':list('bgaaabxeb'), 'val':np.arange(9)}) gp = df.groupby('group') gp.groups.keys() Out[65]: dict_keys(['b', 'e', 'g', 'a', 'x']) </code></pre> if you call <code>groups</code> you can see the order is maintained: <pre class="prettyprint"><code>In [79]: gp.groups Out[79]: {'a': Int64Index([2, 3, 4], dtype='int64'), 'b': Int64Index([0, 5, 8], dtype='int64'), 'e': Int64Index([7], dtype='int64'), 'g': Int64Index([1], dtype='int64'), 'x': Int64Index([6], dtype='int64')} </code></pre> then the key order is maintained, a hack around this is to access the <code>.name</code> attribute of each group: <pre class="prettyprint"><code>In [78]: gp.apply(lambda x: x.name) Out[78]: group a a b b e e g g x x dtype: object </code></pre> which isn't great as this isn't vectorised, however if you already have an aggregated object then you can just get the index values: <pre class="prettyprint"><code>In [81]: agg = gp.sum() agg Out[81]: val group a 9 b 13 e 7 g 1 x 6 In [83]: agg.index.get_level_values(0) Out[83]: Index(['a', 'b', 'e', 'g', 'x'], dtype='object', name='group') </code></pre>

A problem with EdChum's answer is that getting keys by launching <code>gp.groups.keys()</code> first constructs the full group dictionary. On large dataframes, this is a very slow operation, which effectively doubles the memory consumption. Iterating is waaay faster: <pre class="prettyprint"><code>df = pd.DataFrame({'group':list('bgaaabxeb'), 'val':np.arange(9)}) gp = df.groupby('group') keys = [key for key, _ in gp] </code></pre> Executing this list comprehension took me <code>16 s</code> on my groupby object, while I had to interrupt <code>gp.groups.keys()</code> after 3 minutes.

Get all keys from GroupBy object in Pandas

Tags:

python

pandas

I'm looking for a way to get a list of all the keys in a GroupBy object, but I can't seem to find one via the docs nor through Google.

There is definitely a way to access the groups through their keys, like so:

df_gb = df.groupby(['EmployeeNumber']) df_gb.get_group(key)

...so I figure there's a way to access a list (or the like) of the keys in a GroupBy object. I'm looking for something like this:

df_gb.keys Out: [1234, 2356, 6894, 9492]

I figure I could just loop through the GroupBy object and get the keys that way, but I think there's got to be a better way.

712

asked Feb 28 '17 15:02

Nate

2 Answers

You can access this via attribute .groups on the groupby object, this returns a dict, the keys of the dict gives you the groups:

In [40]: df = pd.DataFrame({'group':[0,1,1,1,2,2,3,3,3], 'val':np.arange(9)}) gp = df.groupby('group') gp.groups.keys()  Out[40]: dict_keys([0, 1, 2, 3])

here is the output from groups:

In [41]: gp.groups  Out[41]: {0: Int64Index([0], dtype='int64'),  1: Int64Index([1, 2, 3], dtype='int64'),  2: Int64Index([4, 5], dtype='int64'),  3: Int64Index([6, 7, 8], dtype='int64')}

Update

it looks like that because the type of groups is a dict then the group order isn't maintained when you call keys:

In [65]: df = pd.DataFrame({'group':list('bgaaabxeb'), 'val':np.arange(9)}) gp = df.groupby('group') gp.groups.keys()  Out[65]: dict_keys(['b', 'e', 'g', 'a', 'x'])

if you call groups you can see the order is maintained:

In [79]: gp.groups  Out[79]: {'a': Int64Index([2, 3, 4], dtype='int64'),  'b': Int64Index([0, 5, 8], dtype='int64'),  'e': Int64Index([7], dtype='int64'),  'g': Int64Index([1], dtype='int64'),  'x': Int64Index([6], dtype='int64')}

then the key order is maintained, a hack around this is to access the .name attribute of each group:

In [78]: gp.apply(lambda x: x.name)  Out[78]: group a    a b    b e    e g    g x    x dtype: object

which isn't great as this isn't vectorised, however if you already have an aggregated object then you can just get the index values:

In [81]: agg = gp.sum() agg  Out[81]:        val group      a        9 b       13 e        7 g        1 x        6  In [83]:     agg.index.get_level_values(0)  Out[83]: Index(['a', 'b', 'e', 'g', 'x'], dtype='object', name='group')

126

answered Sep 19 '22 13:09

EdChum

A problem with EdChum's answer is that getting keys by launching gp.groups.keys() first constructs the full group dictionary. On large dataframes, this is a very slow operation, which effectively doubles the memory consumption. Iterating is waaay faster:

df = pd.DataFrame({'group':list('bgaaabxeb'), 'val':np.arange(9)}) gp = df.groupby('group') keys = [key for key, _ in gp]

Executing this list comprehension took me 16 s on my groupby object, while I had to interrupt gp.groups.keys() after 3 minutes.

answered Sep 23 '22 13:09

Dr_Zaszuś

Related questions
                            
                                Copying and pasting code into the Python interpreter
                            
                                How to write a list to a file with newlines in Python3
                            
                                Error while importing Tensorflow in Python 2.7 in Ubuntu 12.04. 'GLIBC_2.17 not found'
                            
                                Meaning of using commas and underscores with Python assignment operator?
                            
                                Streaming data with Python and Flask
                            
                                Convert JSON date string to Python datetime
                            
                                Python 3.6 project structure leads to RuntimeWarning
                            
                                Get Primary Key after Saving a ModelForm in Django
                            
                                Python dictionary keys besides strings and integers?
                            
                                Python: min(None, x)
                            
                                Generate a sequence of numbers in Python
                            
                                How to make a local variable (inside a function) global [duplicate]
                            
                                how to add border around an image in opencv python
                            
                                Pandas - combine column values into a list in a new column
                            
                                ImportError: No module named 'xlrd'
                            
                                What python libraries can tell me approximate location and time zone given an IP address?
                            
                                Objective-C (cocoa) equivalent to python's endswith/beginswith
                            
                                running a command line containing Pipes and displaying result to STDOUT
                            
                                Python: significance of -u option?
                            
                                return default if pandas dataframe.loc location doesn't exist

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With