Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Removing duplicates on very large datasets

I'm working on a 13.9 GB csv file that contains around 16 million rows and 85 columns. I know there are potentially a few hundred thousand rows that are duplicates. I ran this code to remove them

import pandas

concatDf=pandas.read_csv("C:\\OUT\\Concat EPC3.csv")
nodupl=concatDf.drop_duplicates()
nodupl.to_csv("C:\\OUT\\Concat EPC3- NoDupl.csv",index=0)
low_memory=False  

However this runs me into a MemoryError. My ram is 16gb and can't go any higher. Is there a more efficient way of removing duplicates that perhaps does it chunks without me having to break up the csv file into smaller files?

like image 502
Vlad Avatar asked Sep 19 '18 13:09

Vlad


People also ask

How do you remove duplicates with the highest value?

1. If you want to remove all duplicates but leave the highest ones, you can apply this formula =MAX(IF($A$2:$A$12=D2,$B$2:$B$12)), remember to press Shift + Ctrl + Enter keys. 2. In the above formulas, A2:A12 is the original list you need to remove duplicates from.

Should you remove duplicates from a dataset?

You should probably remove them. Duplicates are an extreme case of nonrandom sampling, and they bias your fitted model. Including them will essentially lead to the model overfitting this subset of points.


2 Answers

Essentially the same idea as zwer, but checking for equality in rows with the same hash (instead of automatically discarding duplicated hashes).

file_in = "C:\\OUT\\Concat EPC3.csv"
file_out = "C:\\OUT\\Concat EPC3- NoDupl.csv"

with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
    # Skip header
    next(f_in)
    # Find duplicated hashes
    hashes = set()
    hashes_dup = {}
    for row in f_in:
        h = hash(row)
        if h in hashes:
            hashes_dup[h] = set()
        else:
            hashes.add(h)
    del hashes
    # Rewind file
    f_in.seek(0)
    # Copy header
    f_out.write(next(f_in))
    # Copy non repeated lines
    for row in f_in:
        h = hash(row)
        if h in hashes_dup:
            dups = hashes_dup[h]
            if row in dups:
                continue
            dups.add(row)
        f_out.write(next(f_in))
like image 33
jdehesa Avatar answered Oct 02 '22 21:10

jdehesa


The simplest solution would be creating a hash table for each line in the file - storing 16M hashes in your working memory shouldn't be a problem (depends on the hash size, tho) - then you can iterate over your file again and make sure that you write down only one occurrence of each hash. You don't even need to parse your CSV nor you need Pandas.

import hashlib

with open("input.csv", "r") as f_in, \
        open("output.csv", "w") as f_out:
    seen = set()  # a set to hold our 'visited' lines
    for line in f_in:  # iterate over the input file line by line
        line_hash = hashlib.md5(line.encode()).digest()  # hash the value
        if line_hash not in seen:  # we're seeing this line for the first time
            seen.add(line_hash)  # add it to the hash table
            f_out.write(line)  # write the line to the output

This uses MD5 as a hash so it would take about 16B + set overhead per line, but that's still far less than storing everything in the memory - you can expect ~500MB of memory usage for a 16M lines CSV file.

like image 113
zwer Avatar answered Oct 02 '22 23:10

zwer