Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I remove lines from a big file in Python, within limited environment

Tags:

python

file

lines

Say I have a 10GB HDD Ubuntu VPS in the USA (and I live in some where else), and I have a 9GB text file on the hard drive. I have 512MB of RAM, and about the same amount of swap.

Given the fact that I cannot add more HDD space and cannot move the file to somewhere else to process, is there an efficient method to remove some lines from the file using Python (preferably, but any other language will be acceptable)?

like image 926
James Lin Avatar asked Dec 17 '10 10:12

James Lin


People also ask

How do you remove new lines from a text file in Python?

Moreover, if you want to remove the newlines from the beginning or end of the text file, you should use strip() and rstrip() methods.


2 Answers

How about this? It edits the file in place. I've tested it on some small text files (in Python 2.6.1), but I'm not sure how well it will perform on massive files because of all the jumping around, but still...

I've used a indefinite while loop with a manual EOF check, because for line in f: didn't work correctly (presumably all the jumping around messes up the normal iteration). There may be a better way to check this, but I'm relatively new to Python, so someone please let me know if there is.

Also, you'll need to define the function isRequired(line).

writeLoc = 0
readLoc = 0
with open( "filename" , "r+" ) as f:
    while True:
        line = f.readline()

        #manual EOF check; not sure of the correct
        #Python way to do this manually...
        if line == "":
            break

        #save how far we've read
        readLoc = f.tell()

        #if we need this line write it and
        #update the write location
        if isRequired(line):
            f.seek( writeLoc )
            f.write( line )
            writeLoc = f.tell()
            f.seek( readLoc )

    #finally, chop off the rest of file that's no longer needed
    f.truncate( writeLoc )
like image 128
DMA57361 Avatar answered Sep 28 '22 08:09

DMA57361


Try this:

currentReadPos = 0
removedLinesLength = 0
for line in file:
    currentReadPos = file.tell()
    if remove(line):
        removedLinesLength += len(line)
    else:
        file.seek(file.tell() - removedLinesLength)
        file.write(line + "\n")
        file.flush()
    file.seek(currentReadPos)

I have not run this, but the idea is to modify the file in place by overwriting the lines you want to remove with lines you want to keep. I am not sure how the seeking and modifying interacts with the iterating over the file.

like image 26
Björn Pollex Avatar answered Sep 28 '22 08:09

Björn Pollex