I have a ~220 million row, 7 column csv file. I need to remove row 2636759. This file is 7.7GB, more than will fit in memory. I'm most familiar with R, but could also do this in python or bash.
I can't read or write this file in one operation. What is the best way to build this file incrementally on disk, instead of trying to do this all in memory?
I've tried to find this on SO but have only been able to find how to do this with files that are small enough to read/write in memory, or with rows that are at the beginning of the file.
A python solution:
import os
with open('tmp.csv','w') as tmp:
with open('file.csv','r') as infile:
for linenumber, line in enumerate(infile):
if linenumber != 10234:
tmp.write(line)
# copy back to original file. You can skip this if you don't
# mind (or prefer) having both files lying around
with open('tmp.csv','r') as tmp:
with open('file.csv','w') as out:
for line in tmp:
out.write(line)
os.remove('tmp.csv') # remove the temporary file
This duplicates the data, which may not be optimal if disk space is an issue. An in place write will be more complicated without loading the whole file into RAM first
The key is that python naturally supports handling files as iterables. This means it can be lazily evaluated, and you will never need to hold the entire thing in memory at one time
I like this solution, if your primary concern isn't raw speed, because you can replace the line linenumber != VALUE
with any conditional test, for example, filtering out lines that include a particular date
test = lambda line : 'NOVEMBER' in line
with open('tmp.csv','w') as tmp:
...
if test(line):
...
In-place read-writes and memory mapped file objects (which may be considerably faster) are going to require considerably more book keeping
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With