Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to check empty gzip file in Python

I don't want to use OS commands as that makes it is OS dependent.

This is available in tarfile, tarfile.is_tarfile(filename), to check if a file is a tar file or not.

I am not able to find any relevant commands in the gzip module.


EDIT: Why do I need this: I have list of gzip files, these vary in sizes (1-10 GB) and some are empty. Before reading a file (using pandas.read_csv), I want to check if the file is empty or not, because for empty files I get an error in pandas.read_csv. (Error like: Expected 15 columns and found -1)

Sample command with error:

import pandas as pd
pd.read_csv('C:\Users\...\File.txt.gz', compression='gzip', names={'a', 'b', 'c'}, header=False)
Too many columns specified: expected 3 and found -1

pandas version is 0.16.2

file used for testing, it is just a gzip of empty file.

like image 989
Vipin Avatar asked Jun 17 '16 06:06

Vipin


2 Answers

Unfortunately, any such attempt will likely have a fair bit of overhead, it would likely be cheaper to catch the exception, such as users commented above. A gzip file defines a few fixed size regions, as follows:

Fixed Regions

First, there are 2 bytes for the Gzip magic number, 1 byte for the compression method, 1 byte for the flags, then 4 more bytes for the MTIME (file creation time), 2 bytes for extra flags, and two more bytes for the operating system, giving us a total of 12 bytes so far.

This looks as follows (from the link above):

+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+

Variable Regions

However, this is where things get tricky (and impossible to check without using a gzip module or another deflator).

If extra fields were set, there is a variable region of XLEN bytes set afterwards, which looks as follows:

(if FLG.FEXTRA set)
+---+---+=================================+
| XLEN  |...XLEN bytes of "extra field"...| (more-->)
+---+---+=================================+

After this, there is then a region of N bytes, with a zero-terminated string for the file name (which is, by default, stored):

(if FLG.FNAME set)
+=========================================+
|...original file name, zero-terminated...| (more-->)
+=========================================+

We then have comments:

(if FLG.FCOMMENT set)
+===================================+
|...file comment, zero-terminated...| (more-->)
+===================================+

And finally, a CRC16 (a cyclic redundancy check, in order to make sure the file header then works, all before we get into the variable, compressed data.

Solution

So, any sort of fixed size check will be dependent on whether the filename, or if it was written via pipe (gzip -c "Compress this data" > myfile.gz), other fields, and comments, all which can be defined for null files. So, how do we get around this? Simple, use the gzip module:

import gzip

def check_null(path):
    '''
    Returns an empty string for a null file, which is falsey, 
    and returns a non-empty string otherwise (which is truthey)
    '''

    with gzip.GzipFile(path, 'rb') as f:
        return f.read(1)

This will check if any data exists inside the created file, while only reading a small section of the data. However, this takes a while, it's easier to ask for forgiveness than ask permission.

import contextlib       # python3 only, use a try/except block for Py2
import pandas as pd

with contexlib.suppress(pd.parser.CParserError as error):
    df = pd.read_csv(path, compression='gzip', names={'a', 'b', 'c'}, header=False)
    # do something here
like image 163
Alexander Huszagh Avatar answered Oct 09 '22 03:10

Alexander Huszagh


If you want to check whether a file is a valid Gzip file, you can open it and read one byte from it. If it succeeds, the file is quite probably a gzip file, with one caveat: an empty file also succeeds this test.

Thus we get

def is_gz_file(name):
    with gzip.open(name, 'rb') as f:
        try:
            file_content = f.read(1)
            return True
        except:
            return False

However, as I stated earlier, a file which is empty (0 bytes), still succeeds this test, so you'd perhaps want to ensure that the file is not empty:

def is_gz_file(name):
    if os.stat(name).ST_SIZE == 0:
        return False

    with gzip.open(name, 'rb') as f:
        try:
            file_content = f.read(1)
            return True
        except:
            return False

EDIT:

as the question was now changed to "a gzip file that doesn't have empty contents", then:

def is_nonempty_gz_file(name):
    with gzip.open(name, 'rb') as f:
        try:
            file_content = f.read(1)
            return len(file_content) > 0
        except:
            return False