Reading only the end of huge text file [duplicate]

Tags:

file

Possible Duplicate:
Get last n lines of a file with Python, similar to tail
Read a file in reverse order using python

I have a file that's about 15GB in size, it's a log file that I'm supposed to analyze the output from. I already did a basic parsing of a similar but GREATLY smaller file, with just few line of logging. Parsing strings is not the issue. The issue is the huge file and the amount of redundant data it contains.

Basically I'm attempting to make a python script that I could say to; for example, give me 5000 last lines of the file. That's again basic handling the arguments and all that, nothing special there, I can do that.

But how do I define or tell the file reader to ONLY read the amount of lines I specified from the end of the file? I'm trying to skip the huuuuuuge amount of lines in the beginning of a file since I'm not interested in those and to be honest, reading about 15GB of lines from a txt file takes too long. Is there a way to err.. start reading from.. end of the file? Does that even make sense?

It all boils down to the issue of reading a 15GB file, line by line takes too long. So I want to skip the already redundant data (redundant to me at least) in the beginning and only read the amount of lines from the end of file I want to read.

Obvious answer is to manually just copy N amount of lines from the file to another file but is there a way to do this semi-auto-magically just to read the N amount of lines from the end of the file with python?

440

asked Sep 06 '12 06:09

Mike

2 Answers

Farm this out to unix:

import os
os.popen('tail -n 1000 filepath').read()

use subprocess.Popen instead of os.popen if you need to be able to access stderr (and some other features)

179

answered Oct 22 '22 19:10

user1479095

You need to seek to the end of the file, then read some chunks in blocks from the end, counting lines, until you've found enough newlines to read your n lines.

Basically, you are re-implementing a simple form of tail.

Here's some lightly tested code that does just that:

import os, errno

def lastlines(hugefile, n, bsize=2048):
    # get newlines type, open in universal mode to find it
    with open(hugefile, 'rU') as hfile:
        if not hfile.readline():
            return  # empty, no point
        sep = hfile.newlines  # After reading a line, python gives us this
    assert isinstance(sep, str), 'multiple newline types found, aborting'

    # find a suitable seek position in binary mode
    with open(hugefile, 'rb') as hfile:
        hfile.seek(0, os.SEEK_END)
        linecount = 0
        pos = 0

        while linecount <= n + 1:
            # read at least n lines + 1 more; we need to skip a partial line later on
            try:
                hfile.seek(-bsize, os.SEEK_CUR)           # go backwards
                linecount += hfile.read(bsize).count(sep) # count newlines
                hfile.seek(-bsize, os.SEEK_CUR)           # go back again
            except IOError, e:
                if e.errno == errno.EINVAL:
                    # Attempted to seek past the start, can't go further
                    bsize = hfile.tell()
                    hfile.seek(0, os.SEEK_SET)
                    pos = 0
                    linecount += hfile.read(bsize).count(sep)
                    break
                raise  # Some other I/O exception, re-raise
            pos = hfile.tell()

    # Re-open in text mode
    with open(hugefile, 'r') as hfile:
        hfile.seek(pos, os.SEEK_SET)  # our file position from above

        for line in hfile:
            # We've located n lines *or more*, so skip if needed
            if linecount > n:
                linecount -= 1
                continue
            # The rest we yield
            yield line

answered Oct 22 '22 21:10

Martijn Pieters

Related questions
                            
                                What do the > < signs in numpy dtype mean?
                            
                                Convert rank and partition query to SqlAlchemy
                            
                                Django JSONField inside ArrayField
                            
                                NLTK tokenize - faster way?
                            
                                How to turn an itertools "grouper" object into a list
                            
                                Is it possible to lock versions of packages in Anaconda?
                            
                                How to mock aiohttp.client.ClientSession.get async context manager
                            
                                pytorch: "multi-target not supported" error message
                            
                                python regex where a set of options can occur at most once in a list, in any order
                            
                                Library to render Directed Graphs (similar to graphviz) on Google App Engine
                            
                                SQLAlchemy subquery - average of sums
                            
                                Is Tkinter worth learning? [closed]
                            
                                Learning Twisted
                            
                                Removing from a list while iterating over it
                            
                                numpy arbitrary precision linear algebra
                            
                                Python RPM I built won't install
                            
                                is Flask an MVC or MTV? [closed]
                            
                                'admin' is not a registered namespace in Django 1.4
                            
                                What is the cause of mysqldb's Warning: Truncated incorrect DOUBLE value error?
                            
                                python urllib2 urlopen response

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With