I have the table below contained in mytest.csv as below :
timestamp val1 val2 user_id val3 val4 val5 val6
01/01/2011 1 100 3 5 100 3 5
01/02/2013 20 8 6 12 15 3
01/07/2012 19 57 10 9 6 6
01/11/2014 3100 49 6 12 15 3
21/12/2012 240 30 240 30
01/12/2013 63
01/12/2013 3200 51 63 50
The above was obtained using the following code in which I tried to remove all duplicates but unfortunately some remained (based on 'timestamp' and 'user_id'):
import pandas as pd
newnames = ['timestamp', 'val1', 'val2','val3', 'val4','val5', 'val6','user_id']
df = pd.read_csv('mytest.csv', names = newnames, header = False, parse_dates=True, dayfirst=True)
df['timestamp'] = pd.to_datetime(df['timestamp'], dayfirst=True)
df = df.loc[:,['timestamp', 'user_id', 'val1', 'val2','val3', 'val4','val5', 'val6']]
df_clean = df.drop_duplicates().fillna(0)
Also, I would like to know how I can efficiently remove all duplicate from the data (pre-processing) and if I should do this before reading it into a dataframe. For example the two last rows are considered duplicates and only the last one which do not contain empty val1 (val1 = 3200) should remain in the dataframe.
Thanks in advance for your help.
Remove All Duplicate Rows from Pandas DataFrame You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .
Pandas drop_duplicates() method helps in removing duplicates from the Pandas Dataframe In Python.
If you want to drop duplicates based on specific columns, you can use the subset
argument (older pandas versions: cols
) in drop_duplicates
:
df_clean = df.drop_duplicates(subset=['timestamp', 'user_id'])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With