Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Reading only the end of huge text file [duplicate]

Tags:

python

file

Possible Duplicate:
Get last n lines of a file with Python, similar to tail
Read a file in reverse order using python

I have a file that's about 15GB in size, it's a log file that I'm supposed to analyze the output from. I already did a basic parsing of a similar but GREATLY smaller file, with just few line of logging. Parsing strings is not the issue. The issue is the huge file and the amount of redundant data it contains.

Basically I'm attempting to make a python script that I could say to; for example, give me 5000 last lines of the file. That's again basic handling the arguments and all that, nothing special there, I can do that.

But how do I define or tell the file reader to ONLY read the amount of lines I specified from the end of the file? I'm trying to skip the huuuuuuge amount of lines in the beginning of a file since I'm not interested in those and to be honest, reading about 15GB of lines from a txt file takes too long. Is there a way to err.. start reading from.. end of the file? Does that even make sense?

It all boils down to the issue of reading a 15GB file, line by line takes too long. So I want to skip the already redundant data (redundant to me at least) in the beginning and only read the amount of lines from the end of file I want to read.

Obvious answer is to manually just copy N amount of lines from the file to another file but is there a way to do this semi-auto-magically just to read the N amount of lines from the end of the file with python?

like image 440
Mike Avatar asked Sep 06 '12 06:09

Mike


People also ask

How do I read a 100gb file in Python?

Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.

How do I read large text files in pandas?

We can read data from a text file using read_table() in pandas. This function reads a general delimited file to a DataFrame object. This function is essentially the same as the read_csv() function but with the delimiter = '\t', instead of a comma by default.

How do I open and read a large text file in Python?

To read large text files in Python, we can use the file object as an iterator to iterate over the file and perform the required task. Since the iterator just iterates over the entire file and does not require any additional data structure for data storage, the memory consumed is less comparatively.

Does Python open read file into memory?

No. As per the docs, open() wraps a system call and returns a file object, the file contents are not loaded into RAM (unless you invoke, E.G., readlines()).


2 Answers

Farm this out to unix:

import os
os.popen('tail -n 1000 filepath').read()

use subprocess.Popen instead of os.popen if you need to be able to access stderr (and some other features)

like image 179
user1479095 Avatar answered Oct 22 '22 19:10

user1479095


You need to seek to the end of the file, then read some chunks in blocks from the end, counting lines, until you've found enough newlines to read your n lines.

Basically, you are re-implementing a simple form of tail.

Here's some lightly tested code that does just that:

import os, errno

def lastlines(hugefile, n, bsize=2048):
    # get newlines type, open in universal mode to find it
    with open(hugefile, 'rU') as hfile:
        if not hfile.readline():
            return  # empty, no point
        sep = hfile.newlines  # After reading a line, python gives us this
    assert isinstance(sep, str), 'multiple newline types found, aborting'

    # find a suitable seek position in binary mode
    with open(hugefile, 'rb') as hfile:
        hfile.seek(0, os.SEEK_END)
        linecount = 0
        pos = 0

        while linecount <= n + 1:
            # read at least n lines + 1 more; we need to skip a partial line later on
            try:
                hfile.seek(-bsize, os.SEEK_CUR)           # go backwards
                linecount += hfile.read(bsize).count(sep) # count newlines
                hfile.seek(-bsize, os.SEEK_CUR)           # go back again
            except IOError, e:
                if e.errno == errno.EINVAL:
                    # Attempted to seek past the start, can't go further
                    bsize = hfile.tell()
                    hfile.seek(0, os.SEEK_SET)
                    pos = 0
                    linecount += hfile.read(bsize).count(sep)
                    break
                raise  # Some other I/O exception, re-raise
            pos = hfile.tell()

    # Re-open in text mode
    with open(hugefile, 'r') as hfile:
        hfile.seek(pos, os.SEEK_SET)  # our file position from above

        for line in hfile:
            # We've located n lines *or more*, so skip if needed
            if linecount > n:
                linecount -= 1
                continue
            # The rest we yield
            yield line
like image 28
Martijn Pieters Avatar answered Oct 22 '22 21:10

Martijn Pieters