find duplicate rows in a pandas dataframe

Tags:

I am trying to find duplicates rows in a pandas dataframe.

df=pd.DataFrame(data=[[1,2],[3,4],[1,2],[1,4],[1,2]],columns=['col1','col2'])

df
Out[15]: 
   col1  col2
0     1     2
1     3     4
2     1     2
3     1     4
4     1     2

duplicate_bool = df.duplicated(subset=['col1','col2'], keep='first')
duplicate = df.loc[duplicate_bool == True]

duplicate
Out[16]: 
   col1  col2
2     1     2
4     1     2

Is there a way to add a column referring to the index of the first duplicate (the one kept)

duplicate
Out[16]: 
   col1  col2  index_original
2     1     2               0
4     1     2               0

Note: df could be very very big in my case....

978

asked Nov 08 '17 13:11

gabboshow

1 Answers

Use groupby, create a new column of indexes, and then call duplicated:

df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin')    
df[df.duplicated(subset=['col1','col2'], keep='first')]

   col1  col2  index_original
2     1     2               0
4     1     2               0

Details

I groupby first two columns and then call transform + idxmin to get the first index of each group.

df.groupby(['col1', 'col2']).col1.transform('idxmin') 

0    0
1    1
2    0
3    3
4    0
Name: col1, dtype: int64

duplicated gives me a boolean mask of values I want to keep:

df.duplicated(subset=['col1','col2'], keep='first')

0    False
1    False
2     True
3    False
4     True
dtype: bool

The rest is just boolean indexing.

147

answered Sep 21 '22 03:09

cs95

Related questions
                            
                                Python: one single module (file .py) for each class? [closed]
                            
                                What is the difference between Python's __add__ and __concat__?
                            
                                sklearn classifier get ValueError: bad input shape
                            
                                Set size of matplotlib figure with 3d subplots
                            
                                Why do people default owner parameter to None in __get__?
                            
                                Pandas DataFrame - Combining one column's values with same index into list
                            
                                Saving a cross-validation trained model in Scikit
                            
                                python requests upload large file with additional data
                            
                                Jupyter notebook does not print logs to the output cell
                            
                                How int() object uses "==" operator without __eq__() method in python2?
                            
                                What is the default variable initializer in Tensorflow?
                            
                                Cannot convert string to float in pandas (ValueError)
                            
                                How to document multiple return values using reStructuredText in Python 2?
                            
                                How am I supposed to register a package to PyPI?
                            
                                value error in python statsmodels.tsa.seasonal
                            
                                create a new dataframe from selecting specific rows from existing dataframe python
                            
                                Why Python hasn't true constants? Is it not dangerous?
                            
                                How to share in memory resources between Flask methods when deploying with Gunicorn
                            
                                get_document_topics and get_term_topics in gensim
                            
                                Key <variable_name> not found in checkpoint Tensorflow

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

find duplicate rows in a pandas dataframe

Tags:

python

pandas

dataframe

duplicates

gabboshow

People also ask

1 Answers

cs95

Recent Activity

Donate For Us