Removing duplicates on very large datasets

Tags:

I'm working on a 13.9 GB csv file that contains around 16 million rows and 85 columns. I know there are potentially a few hundred thousand rows that are duplicates. I ran this code to remove them

import pandas

concatDf=pandas.read_csv("C:\\OUT\\Concat EPC3.csv")
nodupl=concatDf.drop_duplicates()
nodupl.to_csv("C:\\OUT\\Concat EPC3- NoDupl.csv",index=0)
low_memory=False

However this runs me into a MemoryError. My ram is 16gb and can't go any higher. Is there a more efficient way of removing duplicates that perhaps does it chunks without me having to break up the csv file into smaller files?

502

asked Sep 19 '18 13:09

Vlad

2 Answers

Essentially the same idea as zwer, but checking for equality in rows with the same hash (instead of automatically discarding duplicated hashes).

file_in = "C:\\OUT\\Concat EPC3.csv"
file_out = "C:\\OUT\\Concat EPC3- NoDupl.csv"

with open(file_in, 'r') as f_in, open(file_out, 'w') as f_out:
    # Skip header
    next(f_in)
    # Find duplicated hashes
    hashes = set()
    hashes_dup = {}
    for row in f_in:
        h = hash(row)
        if h in hashes:
            hashes_dup[h] = set()
        else:
            hashes.add(h)
    del hashes
    # Rewind file
    f_in.seek(0)
    # Copy header
    f_out.write(next(f_in))
    # Copy non repeated lines
    for row in f_in:
        h = hash(row)
        if h in hashes_dup:
            dups = hashes_dup[h]
            if row in dups:
                continue
            dups.add(row)
        f_out.write(next(f_in))

answered Oct 02 '22 21:10

jdehesa

The simplest solution would be creating a hash table for each line in the file - storing 16M hashes in your working memory shouldn't be a problem (depends on the hash size, tho) - then you can iterate over your file again and make sure that you write down only one occurrence of each hash. You don't even need to parse your CSV nor you need Pandas.

import hashlib

with open("input.csv", "r") as f_in, \
        open("output.csv", "w") as f_out:
    seen = set()  # a set to hold our 'visited' lines
    for line in f_in:  # iterate over the input file line by line
        line_hash = hashlib.md5(line.encode()).digest()  # hash the value
        if line_hash not in seen:  # we're seeing this line for the first time
            seen.add(line_hash)  # add it to the hash table
            f_out.write(line)  # write the line to the output

This uses MD5 as a hash so it would take about 16B + set overhead per line, but that's still far less than storing everything in the memory - you can expect ~500MB of memory usage for a 16M lines CSV file.

113

answered Oct 02 '22 23:10

zwer

Related questions
                            
                                asyncio queue consumer coroutine
                            
                                python dictionary datetime as key, keyError
                            
                                RabbitMQ closes connection when processing long running tasks and timeout settings produce errors
                            
                                Right way to plot live data with django and bokeh
                            
                                Adding group bar charts as subplots in plotly
                            
                                How to disable pytest plugins for single tests
                            
                                Table legend in matplotlib
                            
                                How to keep track of players' rankings?
                            
                                How can I learn how to implement a custom Python asyncio event loop?
                            
                                Why is pandas '==' different than '.eq()'
                            
                                python SyntaxError: positional argument follows keyword argument [duplicate]
                            
                                How can I build a GUI to use inside a jupyter notebook?
                            
                                matplotlib Slow 3D scatter rotation
                            
                                Selenium driver's page source different than browser
                            
                                Access flask.g inside greenlet
                            
                                Comparison operators vs “rich comparison” methods in Python
                            
                                How to remove In[ ] and Out[ ] cell tags in a Jupyterlab notebook?
                            
                                How are PyTorch's tensors implemented?
                            
                                Read ZIP files from S3 without downloading the entire file
                            
                                Python 2 -> 3 Django migration causes field parameter type change

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Removing duplicates on very large datasets

Tags:

python

duplicates

large-data

Vlad

People also ask

2 Answers

jdehesa

zwer

Recent Activity

Donate For Us