Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read head / tail of gzip file without reading it into memory [duplicate]

Possible Duplicate:
How can I tail a zipped file without reading its entire contents?

I have a 7GB gzip syslog file that extracts to over 25GB. I need to retrieve only the first and last lines of the file without reading the whole file into memory at once.

GzipFile() in Python 2.7 permits use of with to read the head (iterating via with means I don't have to read the whole file):

>>> from itertools import islice
>>> from gzip import GzipFile
>>> with GzipFile('firewall.4.gz') as file:
...     head = list(islice(file, 1))
>>> head
['Oct  2 07:35:14 192.0.2.1 %ASA-6-305011: Built dynamic TCP translation 
from INSIDE:192.0.2.40/51807 to OUTSIDE:10.18.61.38/2985\n']

Python 2.6 version to avoid issues such as AttributeError: GzipFile instance has no attribute '__exit__' (since GzipFile() doesn't support with iteration on GzipFile())...

>>> from itertools import islice
>>> from gzip import GzipFile
>>> class GzipFileHack(GzipFile):
...     def __enter__(self):
...         return self
...     def __exit__(self, type, value, tb):
...         self.close()
>>> with GzipFileHack('firewall.4.gz') as file:
...     head = list(islice(file, 1))

The problem with this is I have no way to retrieve the tail... islice() doesn't support negative values, and I can't find the way to retrieve the last line without iterating through a 25GB file (which takes way too long).

What is the most efficient way to read the tail of a gzip text file without reading the whole file into memory or iterating over all the lines? If this can't be done, please explain why.

like image 350
Mike Pennington Avatar asked Oct 06 '12 16:10

Mike Pennington


People also ask

How to read a gzip file in Python?

In Python, you can directly work with gzip file. All you need is the Python library gzip. How to read a gzip file line by line in Python? We can also use gzip library to create gzip (compressed) file by dumping the whole text content you have

How to read gzipped files without decompressing?

If you don’t know if the file is compressed or not (i.e. files without .gz extension), you can use zcat with option -f. This will display the content of the file irrespective of whether it is gzipped or not. Same as less and more, you can use zless and zmore to read the content of the compressed files without decompressing the files.

How to create gzip file from plain txt file without reading line by line?

We can create gzip file from plain txt file (unzipped) without reading line by line using shutil library. The shutil module offers high-level operations on files copying and deletion.

Is there a way to view the contents of a GZ file?

Of course there is. In Linux, you can view contents of a compressed .gz file without uncompressing (uncompress on the fly actually or in temp directory) which makes perfect sense for those who deal with large log files and does forensic stuffs. The way it’s done is by using Z commands.


2 Answers

The deflate format used by gzip compresses in part by finding a matching string somewhere in the immediately preceding 32K of the data and using a reference to the string with an offset and a length. So at any point the ability to decompress from that point depends on the last 32K, which itself depends on the 32K preceding it, and so on back to the beginning. Therefore to decompress the data at any point x in the stream, you need to have decompressed everything from 0 to x-1 first.

There are a few ways to mitigate this situation. First, if you want to frequently access a gzip file randomly, then you would be willing to go through the work of scanning the entire gzip file once and building an index. The index would have within it the previous 32K saved at each of some number of entry points, where the density of those entry points determines the speed of the random access. In the zlib source distribution you can see an example of this in examples/zran.c.

If you are in control of the generation of the gzip file, you can use the Z_FULL_FLUSH flush option to periodically to erase the history of the last 32K at those points to allow random access entry. You would then save the locations of those points as the index, which would not need the 32K blocks of history at each entry point. If those points are infrequent enough, there would be a vanishingly small impact on compression.

With just the ability to write gzip output, you can do something similar to Z_FULL_FLUSH with a smidge more overhead by simply writing concatenated gzip streams. gunzip will accept and decode gzip streams that are put together with the cat command, and will write out a single stream of uncompressed data. You can build up a large gzip log in this way, remembering somewhere the offsets of the start of each gzip piece.

If you are only interested in the tail, then you can do what you suggest in one of your comments, which is to simply maintain a cache elsewhere of the tail of the large gzip file.

I don't know if you are making the log file or not. If you are, you may want to look at the example of appending short log messages to a large gzip file efficiently, found again in the zlib source distribution.

like image 93
Mark Adler Avatar answered Oct 03 '22 22:10

Mark Adler


gzip file is a stream, so you'll have to read through it to get to the last line

from gzip import GzipFile
from collections import deque
dq = deque(maxlen=1)
with GzipFile('firewall.4.gz') as file:
    head = next(file)
    dq.extend(file)
tail = dq[0]
like image 42
John La Rooy Avatar answered Oct 03 '22 21:10

John La Rooy