Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove duplicate rows from a large file in Python

I've a csv file that I want to remove duplicate rows from, but it's too large to fit into memory. I found a way to get it done, but my guess is that it's not the best way.

Each row contains 15 fields and several hundred characters, and all fields are needed to determine uniqueness. Instead of comparing the entire row to find a duplicate, I'm comparing hash(row-as-a-string) in an attempt to save memory. I set a filter that partitions the data into a roughly equal number of rows (e.g. days of the week), and each partition is small enough that a lookup table of hash values for that partition will fit in memory. I pass through the file once for each partition, checking for unique rows and writing them out to a second file (pseudo code):

import csv

headers={'DayOfWeek':None, 'a':None, 'b':None}
outs=csv.DictWriter(open('c:\dedupedFile.csv','wb')
days=['Mon','Tue','Wed','Thu','Fri','Sat','Sun']

outs.writerows(headers)

for day in days:
    htable={}
    ins=csv.DictReader(open('c:\bigfile.csv','rb'),headers)
    for line in ins:
        hvalue=hash(reduce(lambda x,y:x+y,line.itervalues()))
        if line['DayOfWeek']==day:
            if hvalue in htable:
                pass
            else:
                htable[hvalue]=None
                outs.writerow(line)

One way I was thinking to speed this up is by finding a better filter to reduce the number of passes necessary. Assuming the length of the rows is uniformly distributed, maybe instead of

for day in days: 

and

if line['DayOfWeek']==day:

we have

for i in range(n):

and

if len(reduce(lambda x,y:x+y,line.itervalues())%n)==i:

where 'n' as small as memory will allow. But this is still using the same method.

Wayne Werner provided a good practical solution below; I was curious if there was better/faster/simpler way to do this from an algorithm perspective.

P.S. I'm limited to Python 2.5.

like image 384
JonC Avatar asked Aug 10 '10 19:08

JonC


2 Answers

If you want a really simple way to do this, just create a sqlite database:

import sqlite3
conn = sqlite3.connect('single.db')
cur = conn.cursor()
cur.execute("""create table test(
f1 text,
f2 text,
f3 text,
f4 text,
f5 text,
f6 text,
f7 text,
f8 text,
f9 text,
f10 text,
f11 text,
f12 text,
f13 text,
f14 text,
f15 text,
primary key(f1,  f2,  f3,  f4,  f5,  f6,  f7,  
            f8,  f9,  f10,  f11,  f12,  f13,  f14,  f15))
"""
conn.commit()

#simplified/pseudo code
for row in reader:
    #assuming row returns a list-type object
    try:
        cur.execute('''insert into test values(?, ?, ?, ?, ?, ?, ?, 
                       ?, ?, ?, ?, ?, ?, ?, ?)''', row)
        conn.commit()
    except IntegrityError:
        pass

conn.commit()
cur.execute('select * from test')

for row in cur:
    #write row to csv file

Then you wouldn't have to worry about any of the comparison logic yourself - just let sqlite take care of it for you. It probably won't be much faster than hashing the strings, but it's probably a lot easier. Of course you'd modify the type stored in the database if you wanted, or not as the case may be. Of course since you're already converting the data to a string you could just have one field instead. Plenty of options here.

like image 160
Wayne Werner Avatar answered Oct 07 '22 05:10

Wayne Werner


You are basically doing a merge sort, and removing duplicated entries.

Breaking the input into memory-sized pieces, sorting each of piece, then merging the pieces while removing duplicates is a sound idea in general.

Actually, up to a couple of gigs I would let the virtual memory system handle it and just write:

input = open(infilename, 'rb')
output = open(outfile, 'wb')

for key,  group in itertools.groupby(sorted(input)):
    output.write(key)
like image 43
Joe Koberg Avatar answered Oct 07 '22 07:10

Joe Koberg