Enumerate each row for each group in a DataFrame

Tags:

python

pandas

In pandas, how can I add a new column which enumerates rows based on a given grouping?

For instance, assume the following DataFrame:

import pandas as pd
import numpy as np

a_list = ['A', 'B', 'C', 'A', 'A', 'C', 'B', 'B', 'A', 'C']
df = pd.DataFrame({'col_a': a_list, 'col_b': range(10)})
df
  col_a  col_b
0     A      0
1     B      1
2     C      2
3     A      3
4     A      4
5     C      5
6     B      6
7     B      7
8     A      8
9     C      9

I'd like to add a col_c that gives me the Nth row of the "group" based on a grouping of col_a and sorting of col_b.

Desired output:

  col_a  col_b  col_c
0     A      0      1
3     A      3      2
4     A      4      3
8     A      8      4
1     B      1      1
6     B      6      2
7     B      7      3
2     C      2      1
5     C      5      2
9     C      9      3

I'm struggling to get to col_c. You can get to the proper grouping and sorting with .sort_index(by=['col_a', 'col_b']), it's now a matter of getting to that new column and labeling each row.

758

asked Jun 21 '13 05:06

Greg Reda

2 Answers

There's cumcount, for precisely this case:

df['col_c'] = g.cumcount()

As it says in the docs:

Number each item in each group from 0 to the length of that group - 1.

Original answer (before cumcount was defined).

You could create a helper function to do this:

def add_col_c(x):
    x['col_c'] = np.arange(len(x))
    return x

First sort by column col_a:

In [11]: df.sort('col_a', inplace=True)

then apply this function across each group:

In [12]: g = df.groupby('col_a', as_index=False)

In [13]: g.apply(add_col_c)
Out[13]:
  col_a  col_b  col_c
3     A      3      0
8     A      8      1
0     A      0      2
4     A      4      3
6     B      6      0
1     B      1      1
7     B      7      2
9     C      9      0
2     C      2      1
5     C      5      2

In order to get 1,2,... you couls use np.arange(1, len(x) + 1).

101

answered Oct 04 '22 02:10

Andy Hayden

The given answers both involve calling a python function for each group, and if you have many groups a vectorized approach should be faster (I havent checked).

Here is my pure numpy suggestion:

In [5]: df.sort(['col_a', 'col_b'], inplace=True, ascending=(False, False))
In [6]: sizes = df.groupby('col_a', sort=False).size().values
In [7]: df['col_c'] = np.arange(sizes.sum()) - np.repeat(sizes.cumsum() - sizes, sizes)
In [8]: print df
  col_a  col_b  col_c
9     C      9      0
5     C      5      1
2     C      2      2
7     B      7      0
6     B      6      1
1     B      1      2
8     A      8      0
4     A      4      1
3     A      3      2
0     A      0      3

answered Oct 04 '22 04:10

andrew

Related questions
                            
                                spaCy and spaCy models in setup.py
                            
                                Finding highest value in a dictionary
                            
                                How to convert pandas dataframe to hierarchical dictionary
                            
                                Looking for a diagram to explain WSGI [closed]
                            
                                Clean Up HTML in Python
                            
                                Passing a JSON object through POST using Python
                            
                                Graphing in Python 3.x
                            
                                List of lists and "Too many values to unpack"
                            
                                Django annotate groupings by month
                            
                                Converting Python Code to PHP [closed]
                            
                                Easy convert betwen SQLAlchemy column types and python data types?
                            
                                Why does foo.append(bar) affect all elements in a list of lists?
                            
                                Flask app that routes based on subdomain
                            
                                Efficient way to convert delimiter separated string to numpy array
                            
                                python 'x days ago' to datetime
                            
                                python time subtraction
                            
                                Set execute bit for a file using python
                            
                                Python generator objects and .join
                            
                                How to resize window in opencv2 python
                            
                                Is there a MATLAB accumarray equivalent in numpy?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With