Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to identify the first occurence of duplicate rows in Python pandas Dataframe

I have a pandas DataFrame with duplicate values for a set of columns. For example:

df = pd.DataFrame({'Column1': {0: 1, 1: 2, 2: 3}, 'Column2': {0: 'ABC', 1: 'XYZ', 2: 'ABC'}, 'Column3': {0: 'DEF', 1: 'DEF', 2: 'DEF'}, 'Column4': {0: 10, 1: 40, 2: 10})

In [2]: df
Out[2]: 
   Column1 Column2 Column3  Column4 is_duplicated  dup_index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0

Row (1) and (3) are same. Essentially, Row (3) is a duplicate of Row (1).

I am looking for the following output:

Is_Duplicate, containing whether the row is a duplicate or not [can be accomplished by using "duplicated" method on dataframe columns (Column2, Column3 and Column4)]

Dup_Index the original index of the duplicate row.

In [3]: df
Out[3]: 
   Column1 Column2 Column3  Column4  Is_Duplicate  Dup_Index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0
like image 632
user1652054 Avatar asked Feb 19 '13 08:02

user1652054


People also ask

How do you check if there are duplicate rows in pandas DataFrame?

The pandas. DataFrame. duplicated() method is used to find duplicate rows in a DataFrame. It returns a boolean series which identifies whether a row is duplicate or unique.

How do you filter out duplicates in DataFrame Python?

Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .

How do you find duplicates in DataFrame Python?

You can use the duplicated() function to find duplicate values in a pandas DataFrame.

How do you show duplicate rows in a DataFrame?

Pandas DataFrame. duplicated() function is used to get/find/select a list of all duplicate rows(all or selected columns) from pandas. Duplicate rows means, having multiple rows on all columns. Using this method you can get duplicate rows on selected multiple columns or all columns.


2 Answers

There is a DataFrame method duplicated for the first column:

In [11]: df.duplicated(['Column2', 'Column3', 'Column4'])
Out[11]: 
0    False
1    False
2     True

In [12]: df['is_duplicated'] = df.duplicated(['Column2', 'Column3', 'Column4'])

To do the second you could try something like this:

In [13]: g = df.groupby(['Column2', 'Column3', 'Column4'])

In [14]: df1 = df.set_index(['Column2', 'Column3', 'Column4'])

In [15]: df1.index.map(lambda ind: g.indices[ind][0])
Out[15]: array([0, 1, 0])

In [16]: df['dup_index'] = df1.index.map(lambda ind: g.indices[ind][0])

In [17]: df
Out[17]: 
   Column1 Column2 Column3  Column4 is_duplicated  dup_index
0        1     ABC     DEF       10         False          0
1        2     XYZ     DEF       40         False          1
2        3     ABC     DEF       10          True          0
like image 180
Andy Hayden Avatar answered Oct 07 '22 22:10

Andy Hayden


Let's say your dataframe is stored in df.

You can use groupby to get non duplicated rows of your dataframe. Here we have to ignore Column1 that is not part of the data:

df_nodup = df.groupby(by=['Column2', 'Column3', 'Column4']).first()

you can then merge this new dataframe with the original one by using the merge function:

df = df.merge(df_nodup, left_on=['Column2', 'Column3', 'Column4'], right_index=True, suffixes=('', '_dupindex'))

You can eventually use the _dupindex column merged in the dataframe to make the simple math to add the columns needed:

df['Is_Duplicate'] = df['Column1']!=df['Column1_dupindex']
df['Dup_Index'] = None
df['Dup_Index'] = df['Dup_Index'].where(df['Column1_dupindex']==df['Column1'], df['Column1_dupindex'])
del df['Column1_dupindex']
like image 23
Zeugma Avatar answered Oct 07 '22 21:10

Zeugma