I'm looking for a way to get a list of all the keys in a GroupBy object, but I can't seem to find one via the docs nor through Google.
There is definitely a way to access the groups through their keys, like so:
df_gb = df.groupby(['EmployeeNumber']) df_gb.get_group(key)
...so I figure there's a way to access a list (or the like) of the keys in a GroupBy object. I'm looking for something like this:
df_gb.keys Out: [1234, 2356, 6894, 9492]
I figure I could just loop through the GroupBy object and get the keys that way, but I think there's got to be a better way.
Python's groupby() function is versatile. It is used to split the data into groups based on some criteria like mean, median, value_counts, etc. In order to reset the index after groupby() we will use the reset_index() function.
You can also reset_index() on your groupby result to get back a dataframe with the name column now accessible. If you perform an operation on a single column the return will be a series with multiindex and you can simply apply pd. DataFrame to it and then reset_index. Show activity on this post.
Returns a groupby object that contains information about the groups. Convenience method for frequency conversion and resampling of time series. See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.
You can access this via attribute .groups
on the groupby
object, this returns a dict, the keys of the dict gives you the groups:
In [40]: df = pd.DataFrame({'group':[0,1,1,1,2,2,3,3,3], 'val':np.arange(9)}) gp = df.groupby('group') gp.groups.keys() Out[40]: dict_keys([0, 1, 2, 3])
here is the output from groups
:
In [41]: gp.groups Out[41]: {0: Int64Index([0], dtype='int64'), 1: Int64Index([1, 2, 3], dtype='int64'), 2: Int64Index([4, 5], dtype='int64'), 3: Int64Index([6, 7, 8], dtype='int64')}
Update
it looks like that because the type of groups
is a dict
then the group order isn't maintained when you call keys
:
In [65]: df = pd.DataFrame({'group':list('bgaaabxeb'), 'val':np.arange(9)}) gp = df.groupby('group') gp.groups.keys() Out[65]: dict_keys(['b', 'e', 'g', 'a', 'x'])
if you call groups
you can see the order is maintained:
In [79]: gp.groups Out[79]: {'a': Int64Index([2, 3, 4], dtype='int64'), 'b': Int64Index([0, 5, 8], dtype='int64'), 'e': Int64Index([7], dtype='int64'), 'g': Int64Index([1], dtype='int64'), 'x': Int64Index([6], dtype='int64')}
then the key order is maintained, a hack around this is to access the .name
attribute of each group:
In [78]: gp.apply(lambda x: x.name) Out[78]: group a a b b e e g g x x dtype: object
which isn't great as this isn't vectorised, however if you already have an aggregated object then you can just get the index values:
In [81]: agg = gp.sum() agg Out[81]: val group a 9 b 13 e 7 g 1 x 6 In [83]: agg.index.get_level_values(0) Out[83]: Index(['a', 'b', 'e', 'g', 'x'], dtype='object', name='group')
A problem with EdChum's answer is that getting keys by launching gp.groups.keys()
first constructs the full group dictionary. On large dataframes, this is a very slow operation, which effectively doubles the memory consumption. Iterating is waaay faster:
df = pd.DataFrame({'group':list('bgaaabxeb'), 'val':np.arange(9)}) gp = df.groupby('group') keys = [key for key, _ in gp]
Executing this list comprehension took me 16 s
on my groupby object, while I had to interrupt gp.groups.keys()
after 3 minutes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With