Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Skip a long line when reading a big file to avoid MemoryError?

I need to scan two large txt files (both about 100GB, 1 billion rows, several columns) and take out a certain column (write to new files). The files look like this

ID*DATE*provider
1111*201101*1234
1234*201402*5678
3214*201003*9012
...

My Python script is

N100 = 10000000   ## 1% of 1 billion rows
with open("myFile.txt") as f:
    with open("myFile_c2.txt", "a") as f2:
        perc = 0
        for ind, line in enumerate(f):   ## <== MemoryError
            c0, c1, c2  = line.split("*")
            f2.write(c2+"\n")
            if ind%N100 == 0: 
                print(perc, "%")
                perc+=1

Now the above script run well for one file but stuck for another one at 62%. The error message says MemoryError for for ind, line in enumerate(f):. I tried several times in different server with different RAM, the error is the same, all at 62%. I waited hours to monitor the RAM and it exploded to 28GB (total=32GB) when 62%. So I guess in that file there is a line that somehow too long (maybe not ended with \n ?) and thus Python stuck when trying reading it to the RAM.

So my question is, before I go to my data provider, what can I do to detect the error line and somehow get around/skip reading it as one huge line? Appreciate any suggestions!

EDIT:

The file, starting from the 'error line', might be all messed together with another line separator rather than \n. If that's the case, can I detect the line sep and continue extracting the columns I want, rather than throwing away them? Thanks!

like image 430
Jason Lou Avatar asked Mar 19 '26 08:03

Jason Lou


1 Answers

This (untested) code might solve your problem. It limits its input to 1,000,000 bytes per read, to reduce its maximum memory consumption.

Note that this code returns the first million characters from each line. There are other possibilities for how to deal with a long line:

  • return the first million characters
  • return the last million characters
  • skip the line entirely, optionally logging that, or
  • raise an exception.

 

#UNTESTED
def read_start_of_line(fp):
    n = int(1e6)
    tmp = result = fp.readline(n)
    while tmp and tmp[-1] != '\n':
        tmp = fp.readline(n)
    return result

N100 = 10000000   ## 1% of 1 billion rows
with open("myFile.txt") as f:
    with open("myFile_c2.txt", "a") as f2:
        perc = 0
        for ind, line in enumerate(iter(lambda: read_start_of_line(f), '')):
            c0, c1, c2  = line.split("*")
            f2.write(c2+"\n")
            if ind%N100 == 0:
                print(perc, "%")
                perc+=1
like image 93
Robᵩ Avatar answered Mar 20 '26 20:03

Robᵩ



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!