Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to assign a unique ID to detect repeated rows in a pandas dataframe?

I am working with a large pandas dataframe, with several columns pretty much like this:

A      B         C    D   

John   Tom       0    1
Homer  Bart      2    3
Tom    Maggie    1    4 
Lisa   John      5    0
Homer  Bart      2    3
Lisa   John      5    0
Homer  Bart      2    3
Homer  Bart      2    3
Tom    Maggie    1    4

How can I assign an unique id to each repeated row? For example:

A      B         C    D      new_id

John   Tom       0    1.2      1
Homer  Bart      2    3.0      2
Tom    Maggie    1    4.2      3
Lisa   John      5    0        4
Homer  Bart      2    3        5
Lisa   John      5    0        4
Homer  Bart      2    3.0      2
Homer  Bart      2    3.0      2
Tom    Maggie    1    4.1      6

I know that I can use duplicate to detect the duplicated rows, however I can not visualize were are reapeting those rows. I tried to:

df.assign(id=(df.columns).astype('category').cat.codes)
df

However, is not working. How can I get a unique id for detecting groups of duplicated rows?

like image 279
anon Avatar asked Jun 29 '18 22:06

anon


2 Answers

For small dataframes, you can convert your rows to tuples, which can be hashed, and then use pd.factorize.

df['new_id'] = pd.factorize(df.apply(tuple, axis=1))[0] + 1

groupby is more efficient for larger dataframes:

df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1
like image 138
jpp Avatar answered Nov 21 '22 08:11

jpp


Group by the columns you are trying to find duplicates over and use ngroup:

df['new_id'] = df.groupby(['A','B','C','D']).ngroup()
like image 39
it's-yer-boy-chet Avatar answered Nov 21 '22 08:11

it's-yer-boy-chet