I have a csv file like this :
column1 column2
john kerry
adam stephenson
ashley hudson
john kerry
etc..
I want to remove duplicates from this file, to get only :
column1 column2
john kerry
adam stephenson
ashley hudson
I wrote this script that removes duplicates based on lastnames, but I need to remove duplicates based on lastnames AND firstname.
import csv
reader=csv.reader(open('myfilewithduplicates.csv', 'r'), delimiter=',')
writer=csv.writer(open('myfilewithoutduplicates.csv', 'w'), delimiter=',')
lastnames = set()
for row in reader:
if row[1] not in lastnames:
writer.writerow(row)
lastnames.add( row[1] )
Often you may want to remove duplicate rows based on two columns in Excel. Fortunately this is easy to do using the Remove Duplicates function within the Data tab.
Navigate to the "Home" option and select duplicate values in the toolbar. Next, navigate to Conditional Formatting in Excel Option. A new window will appear on the screen with options to select "Duplicate" and "Unique" values. You can compare the two columns with matching values or unique values.
You can now use the .drop_duplicates method in pandas. I would do the following:
import pandas as pd
toclean = pd.read_csv('myfilewithduplicates.csv')
deduped = toclean.drop_duplicates([col1,col2])
deduped.to_csv('myfilewithoutduplicates.csv')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With