Possible Duplicate:
How can I tail a zipped file without reading its entire contents?
I have a 7GB gzip syslog file that extracts to over 25GB. I need to retrieve only the first and last lines of the file without reading the whole file into memory at once.
GzipFile()
in Python 2.7 permits use of with
to read the head (iterating via with
means I don't have to read the whole file):
>>> from itertools import islice
>>> from gzip import GzipFile
>>> with GzipFile('firewall.4.gz') as file:
... head = list(islice(file, 1))
>>> head
['Oct 2 07:35:14 192.0.2.1 %ASA-6-305011: Built dynamic TCP translation
from INSIDE:192.0.2.40/51807 to OUTSIDE:10.18.61.38/2985\n']
Python 2.6 version to avoid issues such as AttributeError: GzipFile instance has no attribute '__exit__'
(since GzipFile() doesn't support with
iteration on GzipFile())...
>>> from itertools import islice
>>> from gzip import GzipFile
>>> class GzipFileHack(GzipFile):
... def __enter__(self):
... return self
... def __exit__(self, type, value, tb):
... self.close()
>>> with GzipFileHack('firewall.4.gz') as file:
... head = list(islice(file, 1))
The problem with this is I have no way to retrieve the tail... islice()
doesn't support negative values, and I can't find the way to retrieve the last line without iterating through a 25GB file (which takes way too long).
What is the most efficient way to read the tail of a gzip text file without reading the whole file into memory or iterating over all the lines? If this can't be done, please explain why.
In Python, you can directly work with gzip file. All you need is the Python library gzip. How to read a gzip file line by line in Python? We can also use gzip library to create gzip (compressed) file by dumping the whole text content you have
If you don’t know if the file is compressed or not (i.e. files without .gz extension), you can use zcat with option -f. This will display the content of the file irrespective of whether it is gzipped or not. Same as less and more, you can use zless and zmore to read the content of the compressed files without decompressing the files.
We can create gzip file from plain txt file (unzipped) without reading line by line using shutil library. The shutil module offers high-level operations on files copying and deletion.
Of course there is. In Linux, you can view contents of a compressed .gz file without uncompressing (uncompress on the fly actually or in temp directory) which makes perfect sense for those who deal with large log files and does forensic stuffs. The way it’s done is by using Z commands.
The deflate format used by gzip compresses in part by finding a matching string somewhere in the immediately preceding 32K of the data and using a reference to the string with an offset and a length. So at any point the ability to decompress from that point depends on the last 32K, which itself depends on the 32K preceding it, and so on back to the beginning. Therefore to decompress the data at any point x in the stream, you need to have decompressed everything from 0 to x-1 first.
There are a few ways to mitigate this situation. First, if you want to frequently access a gzip file randomly, then you would be willing to go through the work of scanning the entire gzip file once and building an index. The index would have within it the previous 32K saved at each of some number of entry points, where the density of those entry points determines the speed of the random access. In the zlib source distribution you can see an example of this in examples/zran.c.
If you are in control of the generation of the gzip file, you can use the Z_FULL_FLUSH
flush option to periodically to erase the history of the last 32K at those points to allow random access entry. You would then save the locations of those points as the index, which would not need the 32K blocks of history at each entry point. If those points are infrequent enough, there would be a vanishingly small impact on compression.
With just the ability to write gzip output, you can do something similar to Z_FULL_FLUSH
with a smidge more overhead by simply writing concatenated gzip streams. gunzip
will accept and decode gzip streams that are put together with the cat
command, and will write out a single stream of uncompressed data. You can build up a large gzip log in this way, remembering somewhere the offsets of the start of each gzip piece.
If you are only interested in the tail, then you can do what you suggest in one of your comments, which is to simply maintain a cache elsewhere of the tail of the large gzip file.
I don't know if you are making the log file or not. If you are, you may want to look at the example of appending short log messages to a large gzip file efficiently, found again in the zlib source distribution.
gzip file is a stream, so you'll have to read through it to get to the last line
from gzip import GzipFile
from collections import deque
dq = deque(maxlen=1)
with GzipFile('firewall.4.gz') as file:
head = next(file)
dq.extend(file)
tail = dq[0]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With