Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

remove known exact row in huge csv

Tags:

python

r

csv

I have a ~220 million row, 7 column csv file. I need to remove row 2636759. This file is 7.7GB, more than will fit in memory. I'm most familiar with R, but could also do this in python or bash.

I can't read or write this file in one operation. What is the best way to build this file incrementally on disk, instead of trying to do this all in memory?

I've tried to find this on SO but have only been able to find how to do this with files that are small enough to read/write in memory, or with rows that are at the beginning of the file.

like image 442
hedgedandlevered Avatar asked Apr 21 '16 20:04

hedgedandlevered


1 Answers

A python solution:

import os
with open('tmp.csv','w') as tmp:

    with open('file.csv','r') as infile:
        for linenumber, line in enumerate(infile):
            if linenumber != 10234:
                tmp.write(line)

# copy back to original file. You can skip this if you don't
# mind (or prefer) having both files lying around           
with open('tmp.csv','r') as tmp:
    with open('file.csv','w') as out:
        for line in tmp:
            out.write(line)

os.remove('tmp.csv') # remove the temporary file

This duplicates the data, which may not be optimal if disk space is an issue. An in place write will be more complicated without loading the whole file into RAM first


The key is that python naturally supports handling files as iterables. This means it can be lazily evaluated, and you will never need to hold the entire thing in memory at one time


I like this solution, if your primary concern isn't raw speed, because you can replace the line linenumber != VALUE with any conditional test, for example, filtering out lines that include a particular date

test = lambda line : 'NOVEMBER' in line
with open('tmp.csv','w') as tmp:
    ...
    if test(line):
    ...

In-place read-writes and memory mapped file objects (which may be considerably faster) are going to require considerably more book keeping

like image 126
en_Knight Avatar answered Sep 18 '22 23:09

en_Knight