For dataframe
In [2]: df = pd.DataFrame({'Name': ['foo', 'bar'] * 3,
...: 'Rank': np.random.randint(0,3,6),
...: 'Val': np.random.rand(6)})
...: df
Out[2]:
Name Rank Val
0 foo 0 0.299397
1 bar 0 0.909228
2 foo 0 0.517700
3 bar 0 0.929863
4 foo 1 0.209324
5 bar 2 0.381515
I'm interested in grouping by Name and Rank and possibly getting aggregate values
In [3]: group = df.groupby(['Name', 'Rank'])
In [4]: agg = group.agg(sum)
In [5]: agg
Out[5]:
Val
Name Rank
bar 0 1.839091
2 0.381515
foo 0 0.817097
1 0.209324
But I would like to get a field in the original df
that contains the group number for that row, like
In [13]: df['Group_id'] = [2, 0, 2, 0, 3, 1]
In [14]: df
Out[14]:
Name Rank Val Group_id
0 foo 0 0.299397 2
1 bar 0 0.909228 0
2 foo 0 0.517700 2
3 bar 0 0.929863 0
4 foo 1 0.209324 3
5 bar 2 0.381515 1
Is there a good way to do this in pandas?
I can get it with python,
In [16]: from itertools import count
In [17]: c = count()
In [22]: group.transform(lambda x: c.next())
Out[22]:
Val
0 2
1 0
2 2
3 0
4 3
5 1
but it's pretty slow on a large dataframe, so I figured there may be a better built in pandas way to do this.
By doing groupby() pandas returns you a dict of grouped DFs. You can easily get the key list of this dict by python built in function keys() . This is much more pandorable than other answers. :) groupby() does not return a dict , but a DataFrameGroupBy object.
You can group DataFrame rows into a list by using pandas. DataFrame. groupby() function on the column of interest, select the column you want as a list from group and then use Series. apply(list) to get the list for every group.
Returns a groupby object that contains information about the groups. Convenience method for frequency conversion and resampling of time series. See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.
What is the GroupBy function? Pandas' GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.
A lot of handy things are stored in the DataFrameGroupBy.grouper
object. For example:
>>> df = pd.DataFrame({'Name': ['foo', 'bar'] * 3,
'Rank': np.random.randint(0,3,6),
'Val': np.random.rand(6)})
>>> grouped = df.groupby(["Name", "Rank"])
>>> grouped.grouper.
grouped.grouper.agg_series grouped.grouper.indices
grouped.grouper.aggregate grouped.grouper.labels
grouped.grouper.apply grouped.grouper.levels
grouped.grouper.axis grouped.grouper.names
grouped.grouper.compressed grouped.grouper.ngroups
grouped.grouper.get_group_levels grouped.grouper.nkeys
grouped.grouper.get_iterator grouped.grouper.result_index
grouped.grouper.group_info grouped.grouper.shape
grouped.grouper.group_keys grouped.grouper.size
grouped.grouper.groupings grouped.grouper.sort
grouped.grouper.groups
and so:
>>> df["GroupId"] = df.groupby(["Name", "Rank"]).grouper.group_info[0]
>>> df
Name Rank Val GroupId
0 foo 0 0.302482 2
1 bar 0 0.375193 0
2 foo 2 0.965763 4
3 bar 2 0.166417 1
4 foo 1 0.495124 3
5 bar 2 0.728776 1
There may be a nicer alias for for grouper.group_info[0]
lurking around somewhere, but this should work, anyway.
Use GroupBy.ngroup
from pandas 0.20.2+:
df["GroupId"] = df.groupby(["Name", "Rank"]).ngroup()
print (df)
Name Rank Val GroupId
0 foo 2 0.451724 4
1 bar 0 0.944676 0
2 foo 0 0.822390 2
3 bar 2 0.063603 1
4 foo 1 0.938892 3
5 bar 2 0.332454 1
The correct solution is to use grouper.label_info
:
df["GroupId"] = df.groupby(["Name", "Rank"]).grouper.label_info
It automatically associates each row in the df
dataframe to the corresponding group label.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With