How to assign a unique ID to detect repeated rows in a pandas dataframe?

Question

I am working with a large pandas dataframe, with several columns pretty much like this:

A      B         C    D   

John   Tom       0    1
Homer  Bart      2    3
Tom    Maggie    1    4 
Lisa   John      5    0
Homer  Bart      2    3
Lisa   John      5    0
Homer  Bart      2    3
Homer  Bart      2    3
Tom    Maggie    1    4

How can I assign an unique id to each repeated row? For example:

A      B         C    D      new_id

John   Tom       0    1.2      1
Homer  Bart      2    3.0      2
Tom    Maggie    1    4.2      3
Lisa   John      5    0        4
Homer  Bart      2    3        5
Lisa   John      5    0        4
Homer  Bart      2    3.0      2
Homer  Bart      2    3.0      2
Tom    Maggie    1    4.1      6

I know that I can use duplicate to detect the duplicated rows, however I can not visualize were are reapeting those rows. I tried to:

df.assign(id=(df.columns).astype('category').cat.codes)
df

However, is not working. How can I get a unique id for detecting groups of duplicated rows?

jpp · Accepted Answer

For small dataframes, you can convert your rows to tuples, which can be hashed, and then use pd.factorize.

df['new_id'] = pd.factorize(df.apply(tuple, axis=1))[0] + 1

groupby is more efficient for larger dataframes:

df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1

it's-yer-boy-chet · Answer

Group by the columns you are trying to find duplicates over and use ngroup:

df['new_id'] = df.groupby(['A','B','C','D']).ngroup()

How to assign a unique ID to detect repeated rows in a pandas dataframe?

Tags:

python

python-3.x

pandas

anon

2 Answers

jpp

it's-yer-boy-chet

Recent Activity

Donate For Us

How to assign a unique ID to detect repeated rows in a pandas dataframe?

Tags:

python

python-3.x

pandas

anon

2 Answers

jpp

it's-yer-boy-chet

Related questions

Recent Activity

Donate For Us