Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compare two CSV files and search for similar items

So I've got two CSV files that I'm trying to compare and get the results of the similar items. The first file, hosts.csv is shown below:

Path    Filename    Size    Signature
C:\     a.txt       14kb    012345
D:\     b.txt       99kb    678910
C:\     c.txt       44kb    111213

The second file, masterlist.csv is shown below:

Filename    Signature
b.txt       678910
x.txt       111213
b.txt       777777
c.txt       999999

As you can see the rows do not match up and the masterlist.csv is always larger than the hosts.csv file. The only portion that I'd like to search for is the Signature portion. I know this would look something like:

hosts[3] == masterlist[1]

I am looking for a solution that will give me something like the following (basically the hosts.csv file with a new RESULTS column):

Path    Filename    Size    Signature    RESULTS
C:\     a.txt       14kb    012345       NOT FOUND in masterlist
D:\     b.txt       99kb    678910       FOUND in masterlist (row 1)
C:\     c.txt       44kb    111213       FOUND in masterlist (row 2)

I've searched the posts and found something similar to this here but I don't quite understand it as I'm still learning python.

Edit Using Python 2.6

like image 400
serk Avatar asked Mar 11 '11 04:03

serk


2 Answers

The answer by srgerg is terribly inefficient, as it operates in quadratic time; here is a linear time solution instead, using Python 2.6-compatible syntax:

import csv

with open('masterlist.csv', 'rb') as master:
    master_indices = dict((r[1], i) for i, r in enumerate(csv.reader(master)))

with open('hosts.csv', 'rb') as hosts:
    with open('results.csv', 'wb') as results:    
        reader = csv.reader(hosts)
        writer = csv.writer(results)

        writer.writerow(next(reader, []) + ['RESULTS'])

        for row in reader:
            index = master_indices.get(row[3])
            if index is not None:
                message = 'FOUND in master list (row {})'.format(index)
            else:
                message = 'NOT FOUND in master list'
            writer.writerow(row + [message])

This produces a dictionary, mapping signatures from masterlist.csv to a line number first. Lookups in a dictionary take constant time, making the second loop over hosts.csv rows independant from the number of rows in masterlist.csv. Not to mention code that's a lot simpler.

For those using Python 3, the above only needs to have the open() calls adjusted to open in text mode (remove the b from the file mode), and you want to add new line='' so the CSV reader can take control of line separators. You may want to state the encoding to use explicitly rather than rely on your system default (use encoding=...). The master_indices mapping can be built with a dictionary comprehension ({r[1]: i for i, r in enumerate(csv.reader(master))}).

like image 198
Martijn Pieters Avatar answered Sep 20 '22 02:09

Martijn Pieters


Edit: While my solution works correctly, check out Martijn's answer below for a more efficient solution.

You can find the documentation for the python CSV module here.

What you're looking for is something like this:

import csv

f1 = file('hosts.csv', 'r')
f2 = file('masterlist.csv', 'r')
f3 = file('results.csv', 'w')

c1 = csv.reader(f1)
c2 = csv.reader(f2)
c3 = csv.writer(f3)

masterlist = list(c2)

for hosts_row in c1:
    row = 1
    found = False
    for master_row in masterlist:
        results_row = hosts_row
        if hosts_row[3] == master_row[1]:
            results_row.append('FOUND in master list (row ' + str(row) + ')')
            found = True
            break
        row = row + 1
    if not found:
        results_row.append('NOT FOUND in master list')
    c3.writerow(results_row)

f1.close()
f2.close()
f3.close()
like image 43
srgerg Avatar answered Sep 19 '22 02:09

srgerg