I need to scan two large txt files (both about 100GB, 1 billion rows, several columns) and take out a certain column (write to new files). The files look like this
ID*DATE*provider
1111*201101*1234
1234*201402*5678
3214*201003*9012
...
My Python script is
N100 = 10000000 ## 1% of 1 billion rows
with open("myFile.txt") as f:
with open("myFile_c2.txt", "a") as f2:
perc = 0
for ind, line in enumerate(f): ## <== MemoryError
c0, c1, c2 = line.split("*")
f2.write(c2+"\n")
if ind%N100 == 0:
print(perc, "%")
perc+=1
Now the above script run well for one file but stuck for another one at 62%. The error message says MemoryError for for ind, line in enumerate(f):. I tried several times in different server with different RAM, the error is the same, all at 62%. I waited hours to monitor the RAM and it exploded to 28GB (total=32GB) when 62%. So I guess in that file there is a line that somehow too long (maybe not ended with \n ?) and thus Python stuck when trying reading it to the RAM.
So my question is, before I go to my data provider, what can I do to detect the error line and somehow get around/skip reading it as one huge line? Appreciate any suggestions!
EDIT:
The file, starting from the 'error line', might be all messed together with another line separator rather than \n. If that's the case, can I detect the line sep and continue extracting the columns I want, rather than throwing away them? Thanks!
This (untested) code might solve your problem. It limits its input to 1,000,000 bytes per read, to reduce its maximum memory consumption.
Note that this code returns the first million characters from each line. There are other possibilities for how to deal with a long line:
#UNTESTED
def read_start_of_line(fp):
n = int(1e6)
tmp = result = fp.readline(n)
while tmp and tmp[-1] != '\n':
tmp = fp.readline(n)
return result
N100 = 10000000 ## 1% of 1 billion rows
with open("myFile.txt") as f:
with open("myFile_c2.txt", "a") as f2:
perc = 0
for ind, line in enumerate(iter(lambda: read_start_of_line(f), '')):
c0, c1, c2 = line.split("*")
f2.write(c2+"\n")
if ind%N100 == 0:
print(perc, "%")
perc+=1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With