Read head / tail of gzip file without reading it into memory [duplicate]

Tags:

Possible Duplicate:
How can I tail a zipped file without reading its entire contents?

I have a 7GB gzip syslog file that extracts to over 25GB. I need to retrieve only the first and last lines of the file without reading the whole file into memory at once.

GzipFile() in Python 2.7 permits use of with to read the head (iterating via with means I don't have to read the whole file):

>>> from itertools import islice
>>> from gzip import GzipFile
>>> with GzipFile('firewall.4.gz') as file:
...     head = list(islice(file, 1))
>>> head
['Oct  2 07:35:14 192.0.2.1 %ASA-6-305011: Built dynamic TCP translation 
from INSIDE:192.0.2.40/51807 to OUTSIDE:10.18.61.38/2985\n']

Python 2.6 version to avoid issues such as AttributeError: GzipFile instance has no attribute '__exit__' (since GzipFile() doesn't support with iteration on GzipFile())...

>>> from itertools import islice
>>> from gzip import GzipFile
>>> class GzipFileHack(GzipFile):
...     def __enter__(self):
...         return self
...     def __exit__(self, type, value, tb):
...         self.close()
>>> with GzipFileHack('firewall.4.gz') as file:
...     head = list(islice(file, 1))

The problem with this is I have no way to retrieve the tail... islice() doesn't support negative values, and I can't find the way to retrieve the last line without iterating through a 25GB file (which takes way too long).

What is the most efficient way to read the tail of a gzip text file without reading the whole file into memory or iterating over all the lines? If this can't be done, please explain why.

350

asked Oct 06 '12 16:10

Mike Pennington

2 Answers

The deflate format used by gzip compresses in part by finding a matching string somewhere in the immediately preceding 32K of the data and using a reference to the string with an offset and a length. So at any point the ability to decompress from that point depends on the last 32K, which itself depends on the 32K preceding it, and so on back to the beginning. Therefore to decompress the data at any point x in the stream, you need to have decompressed everything from 0 to x-1 first.

There are a few ways to mitigate this situation. First, if you want to frequently access a gzip file randomly, then you would be willing to go through the work of scanning the entire gzip file once and building an index. The index would have within it the previous 32K saved at each of some number of entry points, where the density of those entry points determines the speed of the random access. In the zlib source distribution you can see an example of this in examples/zran.c.

If you are in control of the generation of the gzip file, you can use the Z_FULL_FLUSH flush option to periodically to erase the history of the last 32K at those points to allow random access entry. You would then save the locations of those points as the index, which would not need the 32K blocks of history at each entry point. If those points are infrequent enough, there would be a vanishingly small impact on compression.

With just the ability to write gzip output, you can do something similar to Z_FULL_FLUSH with a smidge more overhead by simply writing concatenated gzip streams. gunzip will accept and decode gzip streams that are put together with the cat command, and will write out a single stream of uncompressed data. You can build up a large gzip log in this way, remembering somewhere the offsets of the start of each gzip piece.

If you are only interested in the tail, then you can do what you suggest in one of your comments, which is to simply maintain a cache elsewhere of the tail of the large gzip file.

I don't know if you are making the log file or not. If you are, you may want to look at the example of appending short log messages to a large gzip file efficiently, found again in the zlib source distribution.

answered Oct 03 '22 22:10

Mark Adler

gzip file is a stream, so you'll have to read through it to get to the last line

from gzip import GzipFile
from collections import deque
dq = deque(maxlen=1)
with GzipFile('firewall.4.gz') as file:
    head = next(file)
    dq.extend(file)
tail = dq[0]

answered Oct 03 '22 21:10

John La Rooy

Related questions
                            
                                C#, Pass Array As Function Parameters
                            
                                Can you really scale up with Django...given that you can only use one database? (In the models.py and settings.py)
                            
                                Really weird issue with shelve (python)
                            
                                Python instances and attributes: is this a bug or i got it totally wrong?
                            
                                convert a list of booleans to string
                            
                                installing simplejson on the google appengine
                            
                                Use Twisted's getPage as urlopen?
                            
                                Forcing to make floating point calculations
                            
                                Fastest Text search method in a large text file
                            
                                sorting tuples in python with a custom key
                            
                                Testing all combinations in Python
                            
                                How to display images using different color maps in different figures in matplotlib?
                            
                                Testing if Python string variable holds number (int,float) or non-numeric str?
                            
                                Python's random module made inaccessible by Numpy's random module
                            
                                matching all characters in any order in regex
                            
                                Python: garbage collection fails?
                            
                                Parsing a .pdb file in Python
                            
                                Summing over lists in python - is there a better way?
                            
                                homebrew macvim with python2.7.3 support not working
                            
                                Do we use try,except in every single function?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Read head / tail of gzip file without reading it into memory [duplicate]

Tags:

python

gzip

syslog

Mike Pennington

People also ask

2 Answers

Mark Adler

John La Rooy

Recent Activity

Donate For Us