Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python: Removing duplicate CSV entries

Tags:

python

csv

I have a CSV file with multiple entries. Example csv:

user, phone, email
joe, 123, [email protected]
mary, 456, [email protected]
ed, 123, [email protected]

I'm trying to remove the duplicates by a specific column in the CSV however with the code below I'm getting an "list index out of range". I thought by comparing row[1] with newrows[1] I would find all duplicates and only rewrite the unique entries in file2.csv. This doesn't work though and I can't understand why.

f1 = csv.reader(open('file1.csv', 'rb'))
    newrows = []
    for row in f1:
        if row[1] not in newrows[1]:
            newrows.append(row)
    writer = csv.writer(open("file2.csv", "wb"))
    writer.writerows(newrows)

My end result is to have a list that maintains the sequence of the file (set won't work...right?) which should look like this:

user, phone, email
joe, 123, [email protected]
mary, 456, [email protected]
like image 364
serk Avatar asked Oct 07 '11 03:10

serk


People also ask

How do I find duplicates in a csv file in Python?

Method 1: Read the csv file and pass it into the data frame. Then, identify the duplicate rows using the duplicated() function. Finally, use the print statement to display the duplicate rows.

How do you remove duplicate records in Python?

You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .

How do I find duplicates in CSV?

Find Duplicates To find duplicate values in a column, click the column header and select Histogram. This will count how many many times each value appears in the dataset. You can search the Histogram for values that show up more than once.

How do you remove duplicate rows in Excel using Python?

We use drop_duplicates() function to remove duplicate records from a data frame in Python scripts.


1 Answers

row[1] refers to the second column in the current row (phone). That's all well in good.

However, you newrows.append(row) add the entire row to the list.

When you check row[1] in newrows you are checking the individual phone number against a list of complete rows. But that's not what you want to do. You need to check against a list or set of just phone numbers. For that, you probably want to keep track of the rows and a set of the observed phone numbers.

Something like:

f1 = csv.reader(open('file1.csv', 'rb'))
writer = csv.writer(open("file2.csv", "wb"))
phone_numbers = set()
for row in f1:
    if row[1] not in phone_numbers:
        writer.writerow(row)
        phone_numbers.add( row[1] )
like image 125
Winston Ewert Avatar answered Oct 13 '22 16:10

Winston Ewert