I have a question about removing duplicates in Python. I've read a bunch of posts but have not yet been able to solve it. I have the following csv file:
EDIT
Input:
ID, Source, 1.A, 1.B, 1.C, 1.D
1, ESPN, 5,7,,,M
1, NY Times,,10,12,W
1, ESPN, 10,,Q,,M
Output should be:
ID, Source, 1.A, 1.B, 1.C, 1.D, duplicate_flag
1, ESPN, 5,7,,,M, duplicate
1, NY Times,,10,12,W, duplicate
1, ESPN, 10,,Q,,M, duplicate
1, NY Times, 5 (or 10 doesn't matter which one),7, 10, 12, W, not_duplicate
In words, if the ID is the same, take values from the row with source "NY Times", if the row with "NY Times" has a blank value and the duplicate row from the "ESPN" source has a value for that cell, take the value from the row with the "ESPN" source. For outputting, flag the original two lines as duplicates and create a third line.
To clarify a bit further, since I need to run this script on many different csv files with different column headers, I can't do something like:
def main():
with open(input_csv, "rb") as infile:
input_fields = ("ID", "Source", "1.A", "1.B", "1.C", "1.D")
reader = csv.DictReader(infile, fieldnames = input_fields)
with open(output_csv, "wb") as outfile:
output_fields = ("ID", "Source", "1.A", "1.B", "1.C", "1.D", "d_flag")
writer = csv.DictWriter(outfile, fieldnames = output_fields)
writer.writerow(dict((h,h) for h in output_fields))
next(reader)
first_row = next(reader)
for next_row in reader:
#stuff
Because I want the program to run on the first two columns independently of whatever other columns are in the table. In other words, "ID" and "Source" will be in every input file, but the rest of the columns will change depending on the file.
Would greatly appreciate any help you can provide! FYI, "Source" can only be: NY Times, ESPN, or Wall Street Journal and the order of priority for duplicates is: take NY Times if available, otherwise take ESPN, otherwise take Wall Street Journal. This holds for every input file.
The below code reads all of the records into a big dictionary whose keys are their identifiers and whose values are dictionaries mapping source names to entire data rows. Then it iterates through the dictionary and gives you the output you asked for.
import csv
header = None
idfld = None
sourcefld = None
record_table = {}
with open('input.csv', 'rb') as csvfile:
reader = csv.reader(csvfile)
for row in reader:
row = [x.strip() for x in row]
if header is None:
header = row
for i, fld in enumerate(header):
if fld == 'ID':
idfld = i
elif fld == 'Source':
sourcefld = i
continue
key = row[idfld]
sourcename = row[sourcefld]
if key not in record_table:
record_table[key] = {sourcename: row, "all_rows": [row]}
else:
if sourcename in record_table[key]:
cur_row = record_table[key][sourcename]
for i, fld in enumerate(row):
if cur_row[i] == '':
record_table[key][sourcename][i] = fld
else:
record_table[key][sourcename] = row
record_table[key]["all_rows"].append(row)
print ', '.join(header) + ', duplicate_flag'
for recordid in record_table:
rowdict = record_table[recordid]
final_row = [''] * len(header)
rowcount = len(rowdict)
for sourcetype in ['NY Times', 'ESPN', 'Wall Street Journal']:
if sourcetype in rowdict:
row = rowdict[sourcetype]
for i, fld in enumerate(row):
if final_row[i] != '':
continue
if fld != '':
final_row[i] = fld
if rowcount > 1:
for row in rowdict["all_rows"]:
print ', '.join(row) + ', duplicate'
print ', '.join(final_row) + ', not_duplicate'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With