Possible Duplicate:
Get last n lines of a file with Python, similar to tail
Read a file in reverse order using python
I have a file that's about 15GB in size, it's a log file that I'm supposed to analyze the output from. I already did a basic parsing of a similar but GREATLY smaller file, with just few line of logging. Parsing strings is not the issue. The issue is the huge file and the amount of redundant data it contains.
Basically I'm attempting to make a python script that I could say to; for example, give me 5000 last lines of the file. That's again basic handling the arguments and all that, nothing special there, I can do that.
But how do I define or tell the file reader to ONLY read the amount of lines I specified from the end of the file? I'm trying to skip the huuuuuuge amount of lines in the beginning of a file since I'm not interested in those and to be honest, reading about 15GB of lines from a txt file takes too long. Is there a way to err.. start reading from.. end of the file? Does that even make sense?
It all boils down to the issue of reading a 15GB file, line by line takes too long. So I want to skip the already redundant data (redundant to me at least) in the beginning and only read the amount of lines from the end of file I want to read.
Obvious answer is to manually just copy N amount of lines from the file to another file but is there a way to do this semi-auto-magically just to read the N amount of lines from the end of the file with python?
Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.
We can read data from a text file using read_table() in pandas. This function reads a general delimited file to a DataFrame object. This function is essentially the same as the read_csv() function but with the delimiter = '\t', instead of a comma by default.
To read large text files in Python, we can use the file object as an iterator to iterate over the file and perform the required task. Since the iterator just iterates over the entire file and does not require any additional data structure for data storage, the memory consumed is less comparatively.
No. As per the docs, open() wraps a system call and returns a file object, the file contents are not loaded into RAM (unless you invoke, E.G., readlines()).
Farm this out to unix:
import os
os.popen('tail -n 1000 filepath').read()
use subprocess.Popen instead of os.popen if you need to be able to access stderr (and some other features)
You need to seek to the end of the file, then read some chunks in blocks from the end, counting lines, until you've found enough newlines to read your n
lines.
Basically, you are re-implementing a simple form of tail.
Here's some lightly tested code that does just that:
import os, errno
def lastlines(hugefile, n, bsize=2048):
# get newlines type, open in universal mode to find it
with open(hugefile, 'rU') as hfile:
if not hfile.readline():
return # empty, no point
sep = hfile.newlines # After reading a line, python gives us this
assert isinstance(sep, str), 'multiple newline types found, aborting'
# find a suitable seek position in binary mode
with open(hugefile, 'rb') as hfile:
hfile.seek(0, os.SEEK_END)
linecount = 0
pos = 0
while linecount <= n + 1:
# read at least n lines + 1 more; we need to skip a partial line later on
try:
hfile.seek(-bsize, os.SEEK_CUR) # go backwards
linecount += hfile.read(bsize).count(sep) # count newlines
hfile.seek(-bsize, os.SEEK_CUR) # go back again
except IOError, e:
if e.errno == errno.EINVAL:
# Attempted to seek past the start, can't go further
bsize = hfile.tell()
hfile.seek(0, os.SEEK_SET)
pos = 0
linecount += hfile.read(bsize).count(sep)
break
raise # Some other I/O exception, re-raise
pos = hfile.tell()
# Re-open in text mode
with open(hugefile, 'r') as hfile:
hfile.seek(pos, os.SEEK_SET) # our file position from above
for line in hfile:
# We've located n lines *or more*, so skip if needed
if linecount > n:
linecount -= 1
continue
# The rest we yield
yield line
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With