Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Create an incremental suffix for values in a pandas column that have duplicate values in another column

Setup

I have a dataframe, df

import pandas as pd

df = pd.DataFrame(
    {
        'Name':['foo','foo','foo','bar','bar','bar','baz','baz','baz'],
        'Color':['red','blue','red','green','green','blue','yellow','orange','red']
    }
)
  Name   Color
0  foo     red
1  foo    blue
2  foo     red
3  bar   green
4  bar   green
5  bar    blue
6  baz  yellow
7  baz  orange
8  baz     red

Desired Output

I would like to add an enumerating suffix for each Name that has a duplicate Color

pd.DataFrame(
    {
        'Name':['foo_1','foo','foo_2','bar_1','bar_2','bar','baz','baz','baz'],
        'Color':['red','blue','red','green','green','blue','yellow','orange','red']
    }
)
    Name   Color
0  foo_1     red
1    foo    blue
2  foo_2     red
3  bar_1   green
4  bar_2   green
5    bar    blue
6    baz  yellow
7    baz  orange
8    baz     red

As you can see, there is a suffix with an incremental count for each time a Name has a repeat Color. If a Name has a Color only one time, there is no suffix added.

What I have tried

I was thinking of taking a .groupby() with an aggregate of .value_counts() to get a total count, and somehow use that to assign the suffixes if necessary. Here is an idea I had that seems very inefficient:

# group by name aggregate color value counts
gb = df.groupby(['Name']).agg(Color_count=('Color','value_counts')).reset_index()

# keep only counts that are >1 ie need a suffix
gb = gb.loc[gb.Color_count > 1].copy()

# merge back to original df
df.merge(gb, on=['Name','Color'],how='left').fillna(0)

# from here, somehow start an incremental suffix for nonzero values of `Color_count`...
like image 800
bismo Avatar asked Oct 31 '25 02:10

bismo


2 Answers

Check for duplicates based on your columns, then use the result to groupby using the same columns as groups and use cumsum to count the duplicates. Combine the values with your column and after that use a mask to keep the non duplicate values.

m =  df.duplicated(subset=['Name', 'Color'], keep=False)
n = m.groupby([df['Name'], df['Color']]).cumsum()
df['New_Name'] = (df['Name'] + '_' + n.astype('str')).where(m , df['Name'])

End result:

 Name   Color New_Name
0  foo     red    foo_1
1  foo    blue      foo
2  foo     red    foo_2
3  bar   green    bar_1
4  bar   green    bar_2
5  bar    blue      bar
6  baz  yellow      baz
7  baz  orange      baz
8  baz     red      baz
like image 151
Triky Avatar answered Nov 01 '25 17:11

Triky


Another possible solution:

g = df.groupby(['Name', 'Color'], sort=False)
df.assign(
    Name = lambda d: d['Name'].mask(
        g.transform('size').gt(1),
        d['Name'] + '_' + (g.cumcount().add(1).astype(str))
    )
)

The code groups rows by (Name, Color) with groupby(sort=False) to preserve original order, then uses assign to rebuild Name: transform('size').gt(1) flags groups where the (Name, Color) pair occurs more than once; mask keeps the original value when the flag is false and, when true, replaces it by concatenating the base name, an underscore, and a 1-based index computed via cumcount().add(1).astype(str); the string cast ensures safe concatenation, yielding foo_1, foo_2, etc., while unique pairs remain unchanged.

Output:

    Name   Color
0  foo_1     red
1    foo    blue
2  foo_2     red
3  bar_1   green
4  bar_2   green
5    bar    blue
6    baz  yellow
7    baz  orange
8    baz     red
like image 44
PaulS Avatar answered Nov 01 '25 18:11

PaulS