For dataframe <pre class="prettyprint"><code>In [2]: df = pd.DataFrame({'Name': ['foo', 'bar'] * 3, ...: 'Rank': np.random.randint(0,3,6), ...: 'Val': np.random.rand(6)}) ...: df Out[2]: Name Rank Val 0 foo 0 0.299397 1 bar 0 0.909228 2 foo 0 0.517700 3 bar 0 0.929863 4 foo 1 0.209324 5 bar 2 0.381515 </code></pre> I'm interested in grouping by Name and Rank and possibly getting aggregate values <pre class="prettyprint"><code>In [3]: group = df.groupby(['Name', 'Rank']) In [4]: agg = group.agg(sum) In [5]: agg Out[5]: Val Name Rank bar 0 1.839091 2 0.381515 foo 0 0.817097 1 0.209324 </code></pre> But I would like to get a field in the original <code>df</code> that contains the group number for that row, like <pre class="prettyprint"><code>In [13]: df['Group_id'] = [2, 0, 2, 0, 3, 1] In [14]: df Out[14]: Name Rank Val Group_id 0 foo 0 0.299397 2 1 bar 0 0.909228 0 2 foo 0 0.517700 2 3 bar 0 0.929863 0 4 foo 1 0.209324 3 5 bar 2 0.381515 1 </code></pre> Is there a good way to do this in pandas? I can get it with python, <pre class="prettyprint"><code>In [16]: from itertools import count In [17]: c = count() In [22]: group.transform(lambda x: c.next()) Out[22]: Val 0 2 1 0 2 2 3 0 4 3 5 1 </code></pre> but it's pretty slow on a large dataframe, so I figured there may be a better built in pandas way to do this.

The correct solution is to use <code>grouper.label_info</code>: <pre class="prettyprint"><code>df["GroupId"] = df.groupby(["Name", "Rank"]).grouper.label_info </code></pre> It automatically associates each row in the <code>df</code> dataframe to the corresponding group label.

Get group id back into pandas dataframe

Tags:

python

pandas

group-by

For dataframe

In [2]: df = pd.DataFrame({'Name': ['foo', 'bar'] * 3,
   ...:                    'Rank': np.random.randint(0,3,6),
   ...:                    'Val': np.random.rand(6)})
   ...: df
Out[2]: 
  Name  Rank       Val
0  foo     0  0.299397
1  bar     0  0.909228
2  foo     0  0.517700
3  bar     0  0.929863
4  foo     1  0.209324
5  bar     2  0.381515

I'm interested in grouping by Name and Rank and possibly getting aggregate values

In [3]: group = df.groupby(['Name', 'Rank'])
In [4]: agg = group.agg(sum)
In [5]: agg
Out[5]: 
                Val
Name Rank          
bar  0     1.839091
     2     0.381515
foo  0     0.817097
     1     0.209324

But I would like to get a field in the original df that contains the group number for that row, like

In [13]: df['Group_id'] = [2, 0, 2, 0, 3, 1]
In [14]: df
Out[14]: 
  Name  Rank       Val  Group_id
0  foo     0  0.299397         2
1  bar     0  0.909228         0
2  foo     0  0.517700         2
3  bar     0  0.929863         0
4  foo     1  0.209324         3
5  bar     2  0.381515         1

Is there a good way to do this in pandas?

I can get it with python,

In [16]: from itertools import count
In [17]: c = count()
In [22]: group.transform(lambda x: c.next())
Out[22]: 
   Val
0    2
1    0
2    2
3    0
4    3
5    1

but it's pretty slow on a large dataframe, so I figured there may be a better built in pandas way to do this.

255

asked Feb 25 '13 17:02

beardc

3 Answers

A lot of handy things are stored in the DataFrameGroupBy.grouper object. For example:

>>> df = pd.DataFrame({'Name': ['foo', 'bar'] * 3,
                   'Rank': np.random.randint(0,3,6),
                   'Val': np.random.rand(6)})
>>> grouped = df.groupby(["Name", "Rank"])
>>> grouped.grouper.
grouped.grouper.agg_series        grouped.grouper.indices
grouped.grouper.aggregate         grouped.grouper.labels
grouped.grouper.apply             grouped.grouper.levels
grouped.grouper.axis              grouped.grouper.names
grouped.grouper.compressed        grouped.grouper.ngroups
grouped.grouper.get_group_levels  grouped.grouper.nkeys
grouped.grouper.get_iterator      grouped.grouper.result_index
grouped.grouper.group_info        grouped.grouper.shape
grouped.grouper.group_keys        grouped.grouper.size
grouped.grouper.groupings         grouped.grouper.sort
grouped.grouper.groups

and so:

>>> df["GroupId"] = df.groupby(["Name", "Rank"]).grouper.group_info[0]
>>> df
  Name  Rank       Val  GroupId
0  foo     0  0.302482        2
1  bar     0  0.375193        0
2  foo     2  0.965763        4
3  bar     2  0.166417        1
4  foo     1  0.495124        3
5  bar     2  0.728776        1

There may be a nicer alias for for grouper.group_info[0] lurking around somewhere, but this should work, anyway.

103

answered Oct 12 '22 01:10

DSM

Use GroupBy.ngroup from pandas 0.20.2+:

df["GroupId"] = df.groupby(["Name", "Rank"]).ngroup()
print (df)
  Name  Rank       Val  GroupId
0  foo     2  0.451724        4
1  bar     0  0.944676        0
2  foo     0  0.822390        2
3  bar     2  0.063603        1
4  foo     1  0.938892        3
5  bar     2  0.332454        1

answered Oct 11 '22 23:10

jezrael

The correct solution is to use grouper.label_info:

df["GroupId"] = df.groupby(["Name", "Rank"]).grouper.label_info

It automatically associates each row in the df dataframe to the corresponding group label.

answered Oct 11 '22 23:10

Luca Pappalardo

Related questions
                            
                                NumPy - What is the difference between frombuffer and fromstring?
                            
                                Yield from coroutine vs yield from task
                            
                                How can I normalize the data in a range of columns in my pandas dataframe
                            
                                Python setting Decimal Place range without rounding?
                            
                                Django get_or_create fails to set field when used with iexact
                            
                                Pandas rolling gives NaN
                            
                                Generate random UTF-8 string in Python
                            
                                What should people new to Python know about its community and ecosystem? [closed]
                            
                                Modify default queryset in django
                            
                                Django unique_together not preventing duplicates
                            
                                Django REST Framework CSRF Failed: CSRF cookie not set
                            
                                Running Python in PowerShell?
                            
                                How do you index on a jinja template?
                            
                                How to reset a DataFrame's indexes for all groups in one step?
                            
                                Python 'map' function inserting NaN, possible to return original values instead?
                            
                                Getting full tweet text from "user_timeline" with tweepy
                            
                                Python Pathlib path object not converting to string [duplicate]
                            
                                Why is Apache-Spark - Python so slow locally as compared to pandas?
                            
                                Proper relative imports: "Unable to import module"
                            
                                Calling functions by array index in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With