I am trying to find a string near the end of a text file. The problem is that the text file can vary greatly in size. From 3MB to 4GB. But everytime I try to run a script to find this string in a text file that is around 3GB, my computer runs out of memory. SO I was wondering if there was anyway for python to find the size of the file and then read the last megabyte of it.
The code I am currently using is as follows, but like I said earlier, I do not seem to have a big enough memory to read such large files.
find_str = "ERROR"
file = open(file_directory)
last_few_lines = file.readlines()[-20:]
error = False
for line in last_few_lines:
if find_str in line:
error = True
Use file.seek():
import os
find_str = "ERROR"
error = False
# Open file with 'b' to specify binary mode
with open(file_directory, 'rb') as file:
file.seek(-1024 * 1024, os.SEEK_END) # Note minus sign
if find_str in file.read():
error = True
You must specify binary mode when you open the file or you will get 'undefined behavior.' Under python2, it might work anyway (it did for me), but under python3 seek()
will raise an io.UnsupportedOperation
exception if the file was opened in the default text mode. The python 3 docs are here. Though it isn't clear from those docs, the SEEK_*
constants are still in the os
module.
Update: Using with
statement for safer resource management, as suggested by Chris Betti.
You can use the tail recipe with a deque to get the last n
lines of a large file:
from collections import deque
def tail(fn, n):
with open(fn) as fin:
return list(deque(fin, n))
Now test this.
First create a big file:
>>> with open('/tmp/lines.txt', 'w') as f:
... for i in range(1,10000000+1):
... print >> f, 'Line {}'.format(i) # Python 3: print('Line {}'.format(i), file=f)
# about 128 MB on my machine
Then test:
print tail('/tmp/lines.txt', 20)
# ['Line 9999981\n', 'Line 9999982\n', 'Line 9999983\n', 'Line 9999984\n', 'Line 9999985\n', 'Line 9999986\n', 'Line 9999987\n', 'Line 9999988\n', 'Line 9999989\n', 'Line 9999990\n', 'Line 9999991\n', 'Line 9999992\n', 'Line 9999993\n', 'Line 9999994\n', 'Line 9999995\n', 'Line 9999996\n', 'Line 9999997\n', 'Line 9999998\n', 'Line 9999999\n', 'Line 10000000\n']
This will return the last n lines rather than the last X bytes of a file. The size of the data is the same as the size of lines -- not the size of the file. The file object fin
is used as an iterator over lines of the file, so the entire file is not resident in memory all at once.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With