Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count duplicate rows in a csv using python

Tags:

python

csv

I imagine this is an easy one for a decent Python dev - Im still learning! Given a csv with duplicate emails I would like to iterate and write out the count of duplicate emails eg:

infile.csv

COLUMN 0
[email protected]
[email protected]
[email protected]
[email protected]

outfile.csv

COLUMN 0                 COLUMN 1
[email protected]           2
[email protected]      1
[email protected]        1

So far I can remove duplicates with

import csv

f = csv.reader(open('infile.csv','rb'))
writer = csv.writer(open('outfile.csv','wb'))
emails = set()


for row in f:
    if row[0] not in emails:
        writer.writerow(row)
        emails.add( row[0] )

but I am having trouble writing the count to a new column.

like image 352
enkdr Avatar asked Mar 09 '26 02:03

enkdr


1 Answers

Using defaultdict which is in Python2.6

from collections import defaultdict

# count all the emails before we write anything out
emails = defaultdict(int)
for row in f:
    emails[row[0]] += 1

# now write the file
for row in email.items():
    writer.writerow(row)
like image 180
John La Rooy Avatar answered Mar 11 '26 15:03

John La Rooy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!