I have a dataframe, df
import pandas as pd
df = pd.DataFrame(
{
'Name':['foo','foo','foo','bar','bar','bar','baz','baz','baz'],
'Color':['red','blue','red','green','green','blue','yellow','orange','red']
}
)
Name Color
0 foo red
1 foo blue
2 foo red
3 bar green
4 bar green
5 bar blue
6 baz yellow
7 baz orange
8 baz red
I would like to add an enumerating suffix for each Name that has a duplicate Color
pd.DataFrame(
{
'Name':['foo_1','foo','foo_2','bar_1','bar_2','bar','baz','baz','baz'],
'Color':['red','blue','red','green','green','blue','yellow','orange','red']
}
)
Name Color
0 foo_1 red
1 foo blue
2 foo_2 red
3 bar_1 green
4 bar_2 green
5 bar blue
6 baz yellow
7 baz orange
8 baz red
As you can see, there is a suffix with an incremental count for each time a Name has a repeat Color. If a Name has a Color only one time, there is no suffix added.
I was thinking of taking a .groupby() with an aggregate of .value_counts() to get a total count, and somehow use that to assign the suffixes if necessary. Here is an idea I had that seems very inefficient:
# group by name aggregate color value counts
gb = df.groupby(['Name']).agg(Color_count=('Color','value_counts')).reset_index()
# keep only counts that are >1 ie need a suffix
gb = gb.loc[gb.Color_count > 1].copy()
# merge back to original df
df.merge(gb, on=['Name','Color'],how='left').fillna(0)
# from here, somehow start an incremental suffix for nonzero values of `Color_count`...
Check for duplicates based on your columns, then use the result to groupby using the same columns as groups and use cumsum to count the duplicates. Combine the values with your column and after that use a mask to keep the non duplicate values.
m = df.duplicated(subset=['Name', 'Color'], keep=False)
n = m.groupby([df['Name'], df['Color']]).cumsum()
df['New_Name'] = (df['Name'] + '_' + n.astype('str')).where(m , df['Name'])
End result:
Name Color New_Name
0 foo red foo_1
1 foo blue foo
2 foo red foo_2
3 bar green bar_1
4 bar green bar_2
5 bar blue bar
6 baz yellow baz
7 baz orange baz
8 baz red baz
Another possible solution:
g = df.groupby(['Name', 'Color'], sort=False)
df.assign(
Name = lambda d: d['Name'].mask(
g.transform('size').gt(1),
d['Name'] + '_' + (g.cumcount().add(1).astype(str))
)
)
The code groups rows by (Name, Color) with groupby(sort=False) to preserve original order, then uses assign to rebuild Name: transform('size').gt(1) flags groups where the (Name, Color) pair occurs more than once; mask keeps the original value when the flag is false and, when true, replaces it by concatenating the base name, an underscore, and a 1-based index computed via cumcount().add(1).astype(str); the string cast ensures safe concatenation, yielding foo_1, foo_2, etc., while unique pairs remain unchanged.
Output:
Name Color
0 foo_1 red
1 foo blue
2 foo_2 red
3 bar_1 green
4 bar_2 green
5 bar blue
6 baz yellow
7 baz orange
8 baz red
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With