How to remove efficiently all duplicates in dataframe or csv file in python?

Tags:

pandas

I have the table below contained in mytest.csv as below :

timestamp   val1    val2    user_id  val3  val4    val5    val6
01/01/2011  1   100 3    5     100     3       5
01/02/2013  20  8        6     12      15      3
01/07/2012      19  57   10    9       6       6        
01/11/2014  3100    49  6        12    15      3
21/12/2012          240  30    240     30       
01/12/2013          63                  
01/12/2013  3200    51  63       50

The above was obtained using the following code in which I tried to remove all duplicates but unfortunately some remained (based on 'timestamp' and 'user_id'):

import pandas as pd

newnames = ['timestamp', 'val1', 'val2','val3', 'val4','val5', 'val6','user_id']
df = pd.read_csv('mytest.csv', names = newnames, header = False, parse_dates=True, dayfirst=True)
df['timestamp'] = pd.to_datetime(df['timestamp'], dayfirst=True) 
df = df.loc[:,['timestamp', 'user_id', 'val1', 'val2','val3', 'val4','val5', 'val6']]
df_clean = df.drop_duplicates().fillna(0)

Also, I would like to know how I can efficiently remove all duplicate from the data (pre-processing) and if I should do this before reading it into a dataframe. For example the two last rows are considered duplicates and only the last one which do not contain empty val1 (val1 = 3200) should remain in the dataframe.

Thanks in advance for your help.

669

asked Apr 04 '14 15:04

Space

1 Answers

If you want to drop duplicates based on specific columns, you can use the subset argument (older pandas versions: cols) in drop_duplicates:

df_clean = df.drop_duplicates(subset=['timestamp', 'user_id'])

answered Oct 17 '22 11:10

joris

Related questions
                            
                                Run code in python script on shutdown signal
                            
                                Updating a pyplot 3d scatter plot in a loop, grid lines overlap points
                            
                                How to find and replace multiple lines in text file?
                            
                                Regex match if not before and after
                            
                                Does assertRaises (or assert_raises) exist in nose2
                            
                                Python mockito - Mocking a class which is being instantiated from the testable function
                            
                                Weird pdfs from Generalised Extreme Value (GEV) Maximum Likelihood fitted data
                            
                                how to run local python script on remote machine
                            
                                How does Python provide a maintainable way to pass data structures around in a system?
                            
                                AWS EB Flask does not recognize static files
                            
                                TypeError: list indices must be integers, not list. How to fix?
                            
                                Are there any efficient python libraries for Dynamic Topic Models, preferably extending Gensim?
                            
                                What does second number in "POST /myfunction/ HTTP/1.1" 200 23" mean?
                            
                                How do you merge two directories, or move with replace, from the windows command line without copying?
                            
                                numpy array using python's long type
                            
                                using the hardware rng from python
                            
                                Is there a way to go back when reading a file using seek and calls to next()?
                            
                                Using BeautifulSoup To Extract Specific TD Table Elements Text?
                            
                                two admin classes for one model django
                            
                                'Q' object has no attribute 'split' - Django

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With