I'm working on a 13.9 GB csv file that contains around 16 million rows and 85 columns. I know there are potentially a few hundred thousand rows that are duplicates. I ran this code to remove them
import pandas
concatDf=pandas.read_csv("C:\\OUT\\Concat EPC3.csv")
nodupl=concatDf.drop_duplicates()
nodupl.to_csv("C:\\OUT\\Concat EPC3- NoDupl.csv",index=0)
low_memory=False
However this runs me into a MemoryError. My ram is 16gb and can't go any higher. Is there a more efficient way of removing duplicates that perhaps does it chunks without me having to break up the csv file into smaller files?
1. If you want to remove all duplicates but leave the highest ones, you can apply this formula =MAX(IF($A$2:$A$12=D2,$B$2:$B$12)), remember to press Shift + Ctrl + Enter keys. 2. In the above formulas, A2:A12 is the original list you need to remove duplicates from.
You should probably remove them. Duplicates are an extreme case of nonrandom sampling, and they bias your fitted model. Including them will essentially lead to the model overfitting this subset of points.
Essentially the same idea as zwer, but checking for equality in rows with the same hash (instead of automatically discarding duplicated hashes).
file_in = "C:\\OUT\\Concat EPC3.csv"
file_out = "C:\\OUT\\Concat EPC3- NoDupl.csv"
with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
# Skip header
next(f_in)
# Find duplicated hashes
hashes = set()
hashes_dup = {}
for row in f_in:
h = hash(row)
if h in hashes:
hashes_dup[h] = set()
else:
hashes.add(h)
del hashes
# Rewind file
f_in.seek(0)
# Copy header
f_out.write(next(f_in))
# Copy non repeated lines
for row in f_in:
h = hash(row)
if h in hashes_dup:
dups = hashes_dup[h]
if row in dups:
continue
dups.add(row)
f_out.write(next(f_in))
The simplest solution would be creating a hash table for each line in the file - storing 16M hashes in your working memory shouldn't be a problem (depends on the hash size, tho) - then you can iterate over your file again and make sure that you write down only one occurrence of each hash. You don't even need to parse your CSV nor you need Pandas.
import hashlib
with open("input.csv", "r") as f_in, \
open("output.csv", "w") as f_out:
seen = set() # a set to hold our 'visited' lines
for line in f_in: # iterate over the input file line by line
line_hash = hashlib.md5(line.encode()).digest() # hash the value
if line_hash not in seen: # we're seeing this line for the first time
seen.add(line_hash) # add it to the hash table
f_out.write(line) # write the line to the output
This uses MD5 as a hash so it would take about 16B + set overhead per line, but that's still far less than storing everything in the memory - you can expect ~500MB of memory usage for a 16M lines CSV file.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With