I have a CSV file with multiple entries. Example csv:
user, phone, email
joe, 123, [email protected]
mary, 456, [email protected]
ed, 123, [email protected]
I'm trying to remove the duplicates by a specific column in the CSV however with the code below I'm getting an "list index out of range". I thought by comparing row[1]
with newrows[1]
I would find all duplicates and only rewrite the unique entries in file2.csv
. This doesn't work though and I can't understand why.
f1 = csv.reader(open('file1.csv', 'rb'))
newrows = []
for row in f1:
if row[1] not in newrows[1]:
newrows.append(row)
writer = csv.writer(open("file2.csv", "wb"))
writer.writerows(newrows)
My end result is to have a list that maintains the sequence of the file (set
won't work...right?) which should look like this:
user, phone, email
joe, 123, [email protected]
mary, 456, [email protected]
Method 1: Read the csv file and pass it into the data frame. Then, identify the duplicate rows using the duplicated() function. Finally, use the print statement to display the duplicate rows.
You can set 'keep=False' in the drop_duplicates() function to remove all the duplicate rows. For E.x, df. drop_duplicates(keep=False) .
Find Duplicates To find duplicate values in a column, click the column header and select Histogram. This will count how many many times each value appears in the dataset. You can search the Histogram for values that show up more than once.
We use drop_duplicates() function to remove duplicate records from a data frame in Python scripts.
row[1]
refers to the second column in the current row (phone). That's all well in good.
However, you newrows.append(row)
add the entire row to the list.
When you check row[1] in newrows
you are checking the individual phone number against a list of complete rows. But that's not what you want to do. You need to check against a list or set of just phone numbers. For that, you probably want to keep track of the rows and a set of the observed phone numbers.
Something like:
f1 = csv.reader(open('file1.csv', 'rb'))
writer = csv.writer(open("file2.csv", "wb"))
phone_numbers = set()
for row in f1:
if row[1] not in phone_numbers:
writer.writerow(row)
phone_numbers.add( row[1] )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With