Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove duplicates in a csv file based on two columns?

Tags:

python

I have a csv file like this :

column1    column2

john       kerry
adam       stephenson
ashley     hudson
john       kerry
etc..

I want to remove duplicates from this file, to get only :

column1    column2

john       kerry
adam       stephenson
ashley     hudson

I wrote this script that removes duplicates based on lastnames, but I need to remove duplicates based on lastnames AND firstname.

import csv

reader=csv.reader(open('myfilewithduplicates.csv', 'r'), delimiter=',')
writer=csv.writer(open('myfilewithoutduplicates.csv', 'w'), delimiter=',')

lastnames = set()
for row in reader:
    if row[1] not in lastnames:
        writer.writerow(row)
        lastnames.add( row[1] )
like image 319
Reveclair Avatar asked Oct 12 '12 01:10

Reveclair


People also ask

Can you remove duplicates based on two columns?

Often you may want to remove duplicate rows based on two columns in Excel. Fortunately this is easy to do using the Remove Duplicates function within the Data tab.

How do you compare and remove duplicates in two columns?

Navigate to the "Home" option and select duplicate values in the toolbar. Next, navigate to Conditional Formatting in Excel Option. A new window will appear on the screen with options to select "Duplicate" and "Unique" values. You can compare the two columns with matching values or unique values.


1 Answers

You can now use the .drop_duplicates method in pandas. I would do the following:

import pandas as pd
toclean = pd.read_csv('myfilewithduplicates.csv')
deduped = toclean.drop_duplicates([col1,col2])
deduped.to_csv('myfilewithoutduplicates.csv')
like image 194
Bradley Avatar answered Oct 03 '22 12:10

Bradley