When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to <code>dplyr::group_indices</code> in R. For example, if I have <pre class="prettyprint"><code>>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]}) >>> df a b 0 1 1 1 1 1 2 1 2 3 2 1 4 2 1 5 2 2 </code></pre> How can I get a DataFrame like <pre class="prettyprint"><code> a b idx 0 1 1 1 1 1 1 1 2 1 2 2 3 2 1 3 4 2 1 3 5 2 2 4 </code></pre> (the order of the <code>idx</code> indexes doesn't matter)

Here is the solution using <code>ngroup</code> (available as of pandas 0.20.2) from a comment above by Constantino. <pre class="prettyprint"><code>import pandas as pd df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]}) df['idx'] = df.groupby(['a', 'b']).ngroup() df </code></pre> <pre class="prettyprint"><code> a b idx 0 1 1 0 1 1 1 0 2 1 2 1 3 2 1 2 4 2 1 2 5 2 2 3 </code></pre>

Here's a concise way using <code>drop_duplicates</code> and <code>merge</code> to get a unique identifier. <pre class="prettyprint"><code>group_vars = ['a','b'] df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars ) a b index 0 1 1 0 1 1 1 0 2 1 2 2 3 2 1 3 4 2 1 3 5 2 2 5 </code></pre> The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional <code>reset_index(drop=True)</code>. Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the <code>ngroup</code> method as noted in a comment to the question above by @Constantino and a subsequent answer by @CalumYou. I'll leave this here as an alternate approach but <code>ngroup</code> seems like the better way to do this in most cases.

A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels: <pre class="prettyprint"><code>df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes df a b idx 0 1 1 0 1 1 1 0 2 1 2 1 3 2 1 2 4 2 1 2 5 2 2 3 </code></pre> Edit: changed <code>labels</code> properties to <code>codes</code> as the former seem to be deprecated Edit2: Added a separator as suggested by Authman Apatira

Definetely not the most straightforward solution, but here is what I would do (comments in the code): <pre class="prettyprint"><code>df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]}) #create a dummy grouper id by just joining desired rows df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1) print df </code></pre> That would generate an unique idx for each combination of <code>a</code> and <code>b</code>. <pre class="prettyprint"><code> a b idx 0 1 1 11 1 1 1 11 2 1 2 12 3 2 1 21 4 2 1 21 5 2 2 22 </code></pre> But this is still a rather silly index (think about some more complex values in columns <code>a</code> and <code>b</code>. So let's clear the index: <pre class="prettyprint"><code># create a dictionary of dummy group_ids and their index-wise representation dict_idx = dict(enumerate(set(df["idx"]))) # switch keys and values, so you can use dict in .replace method dict_idx = {y:x for x,y in dict_idx.iteritems()} #replace values with the generated dict df["idx"].replace(dict_idx,inplace=True) print df </code></pre> That would produce the desired output: <pre class="prettyprint"><code> a b idx 0 1 1 0 1 1 1 0 2 1 2 1 3 2 1 2 4 2 1 2 5 2 2 3 </code></pre>

I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data. <pre class="prettyprint"><code>df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1) </code></pre> Output <pre class="prettyprint"><code>0 1 1 1 2 2 3 3 4 3 5 4 dtype: int64 </code></pre> So breaking this up into steps, lets see the output of <code>df.sort_values(['a', 'b']).diff().fillna(0)</code> which checks if each row is different than the previous row. Any non-zero entry indicates a new group. <pre class="prettyprint"><code> a b 0 0.0 0.0 1 0.0 0.0 2 0.0 1.0 3 1.0 -1.0 4 0.0 0.0 5 0.0 1.0 </code></pre> A new group only need to have a single column different so this is what <code>.ne(0).any(1)</code> checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups. <h3>Answer for columns as strings</h3> <pre class="prettyprint"><code>#create fake data and sort it df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')}) df1 = df.sort_values(['a', 'b']) </code></pre> output of <code>df1</code> <pre class="prettyprint"><code> a b 0 a a 1 a a 4 a a 3 b a 2 b b 5 c c 6 c d 8 c d 7 d d </code></pre> Take similar approach by checking if group has changed <pre class="prettyprint"><code>df1.ne(df1.shift().bfill()).any(1).cumsum().add(1) 0 1 1 1 4 1 3 2 2 3 5 4 6 5 8 5 7 6 </code></pre>

Pandas: assign an index to each group identified by groupby

Tags:

python

pandas

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have

>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
   a  b
0  1  1
1  1  1
2  1  2
3  2  1
4  2  1
5  2  2

How can I get a DataFrame like

(the order of the idx indexes doesn't matter)

745

asked Jan 11 '17 15:01

user2667066

6 Answers

Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.

import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df

   a  b  idx
0  1  1    0
1  1  1    0
2  1  2    1
3  2  1    2
4  2  1    2
5  2  2    3

171

answered Oct 20 '22 16:10

Calum You

Here's a concise way using drop_duplicates and merge to get a unique identifier.

group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )

   a  b  index
0  1  1      0
1  1  1      0
2  1  2      2
3  2  1      3
4  2  1      3
5  2  2      5

The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).

Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by @Constantino and a subsequent answer by @CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.

answered Oct 20 '22 16:10

JohnE

A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:

df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df

    a   b   idx
0   1   1   0
1   1   1   0
2   1   2   1
3   2   1   2
4   2   1   2
5   2   2   3

Edit: changed labels properties to codes as the former seem to be deprecated

Edit2: Added a separator as suggested by Authman Apatira

answered Oct 20 '22 16:10

foglerit

Definetely not the most straightforward solution, but here is what I would do (comments in the code):

df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})

#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)

print df

That would generate an unique idx for each combination of a and b.

   a  b idx
0  1  1  11
1  1  1  11
2  1  2  12
3  2  1  21
4  2  1  21
5  2  2  22

But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:

# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))

# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}

#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)

print df

That would produce the desired output:

   a  b  idx
0  1  1    0
1  1  1    0
2  1  2    1
3  2  1    2
4  2  1    2
5  2  2    3

answered Oct 20 '22 15:10

Marjan Moderc

A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):

def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
    df.sort_values(grouping_cols, inplace=True)
    # You could do the following three lines in one, I just thought 
    # this would be clearer as an explanation of what's going on:
    duplicated = df.duplicated(subset=grouping_cols, keep='first')
    new_group = ~duplicated
    return new_group.cumsum()

Timing results:

a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})

In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop

In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop

answered Oct 20 '22 16:10

maxliving

I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.

df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)

Output

0    1
1    1
2    2
3    3
4    3
5    4
dtype: int64

So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.

     a    b
0  0.0  0.0
1  0.0  0.0
2  0.0  1.0
3  1.0 -1.0
4  0.0  0.0
5  0.0  1.0

A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.

Answer for columns as strings

#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])

output of df1

   a  b
0  a  a
1  a  a
4  a  a
3  b  a
2  b  b
5  c  c
6  c  d
8  c  d
7  d  d

Take similar approach by checking if group has changed

df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)

0    1
1    1
4    1
3    2
2    3
5    4
6    5
8    5
7    6

answered Oct 20 '22 14:10

Ted Petrou

Related questions
                            
                                ln (Natural Log) in Python
                            
                                pyconfig.h missing during "pip install cryptography"
                            
                                pip install dotenv error code 1 Windows 10
                            
                                How to run Docker with python and Java?
                            
                                Efficient Python array with 100 million zeros?
                            
                                Why avoid while loops?
                            
                                Read a large zipped text file line by line in python
                            
                                How to count down in for loop? [duplicate]
                            
                                How to SSH and run commands in EC2 using boto3?
                            
                                AttributeError: module 'attr' has no attribute 's'
                            
                                Distribute an integer amount by a set of slots as evenly as possible
                            
                                Django templates syntax highlighting in Eclipse
                            
                                Python Regex to find a string in double quotes within a string
                            
                                Multiple conditions using 'or' in numpy array
                            
                                Flask Python, trying to return list or dict to Ajax call
                            
                                Installation of pygame with Anaconda
                            
                                How do you declare a global constant in Python?
                            
                                How to crop the internal area of a contour?
                            
                                can't install scipy - freezes on "Running setup.py install for scipy"
                            
                                Understanding argmax

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pandas: assign an index to each group identified by groupby

Tags:

python

pandas

user2667066

People also ask

6 Answers

Calum You

JohnE

foglerit

Marjan Moderc

maxliving

Answer for columns as strings

Ted Petrou

Recent Activity

Donate For Us