I don't want to use OS commands as that makes it is OS dependent.
This is available in tarfile
, tarfile.is_tarfile(filename)
, to check if a file is a tar file or not.
I am not able to find any relevant commands in the gzip
module.
EDIT:
Why do I need this: I have list of gzip files, these vary in sizes (1-10 GB) and some are empty. Before reading a file (using pandas.read_csv
), I want to check if the file is empty or not, because for empty files I get an error in pandas.read_csv
. (Error like: Expected 15 columns and found -1)
Sample command with error:
import pandas as pd
pd.read_csv('C:\Users\...\File.txt.gz', compression='gzip', names={'a', 'b', 'c'}, header=False)
Too many columns specified: expected 3 and found -1
pandas version is 0.16.2
file used for testing, it is just a gzip of empty file.
Unfortunately, any such attempt will likely have a fair bit of overhead, it would likely be cheaper to catch the exception, such as users commented above. A gzip file defines a few fixed size regions, as follows:
Fixed Regions
First, there are 2 bytes for the Gzip magic number, 1 byte for the compression method, 1 byte for the flags, then 4 more bytes for the MTIME (file creation time), 2 bytes for extra flags, and two more bytes for the operating system, giving us a total of 12 bytes so far.
This looks as follows (from the link above):
+---+---+---+---+---+---+---+---+---+---+
|ID1|ID2|CM |FLG| MTIME |XFL|OS | (more-->)
+---+---+---+---+---+---+---+---+---+---+
Variable Regions
However, this is where things get tricky (and impossible to check without using a gzip module or another deflator).
If extra fields were set, there is a variable region of XLEN bytes set afterwards, which looks as follows:
(if FLG.FEXTRA set)
+---+---+=================================+
| XLEN |...XLEN bytes of "extra field"...| (more-->)
+---+---+=================================+
After this, there is then a region of N bytes, with a zero-terminated string for the file name (which is, by default, stored):
(if FLG.FNAME set)
+=========================================+
|...original file name, zero-terminated...| (more-->)
+=========================================+
We then have comments:
(if FLG.FCOMMENT set)
+===================================+
|...file comment, zero-terminated...| (more-->)
+===================================+
And finally, a CRC16 (a cyclic redundancy check, in order to make sure the file header then works, all before we get into the variable, compressed data.
Solution
So, any sort of fixed size check will be dependent on whether the filename, or if it was written via pipe (gzip -c "Compress this data" > myfile.gz
), other fields, and comments, all which can be defined for null files. So, how do we get around this? Simple, use the gzip module:
import gzip
def check_null(path):
'''
Returns an empty string for a null file, which is falsey,
and returns a non-empty string otherwise (which is truthey)
'''
with gzip.GzipFile(path, 'rb') as f:
return f.read(1)
This will check if any data exists inside the created file, while only reading a small section of the data. However, this takes a while, it's easier to ask for forgiveness than ask permission.
import contextlib # python3 only, use a try/except block for Py2
import pandas as pd
with contexlib.suppress(pd.parser.CParserError as error):
df = pd.read_csv(path, compression='gzip', names={'a', 'b', 'c'}, header=False)
# do something here
If you want to check whether a file is a valid Gzip file, you can open it and read one byte from it. If it succeeds, the file is quite probably a gzip file, with one caveat: an empty file also succeeds this test.
Thus we get
def is_gz_file(name):
with gzip.open(name, 'rb') as f:
try:
file_content = f.read(1)
return True
except:
return False
However, as I stated earlier, a file which is empty (0 bytes), still succeeds this test, so you'd perhaps want to ensure that the file is not empty:
def is_gz_file(name):
if os.stat(name).ST_SIZE == 0:
return False
with gzip.open(name, 'rb') as f:
try:
file_content = f.read(1)
return True
except:
return False
EDIT:
as the question was now changed to "a gzip file that doesn't have empty contents", then:
def is_nonempty_gz_file(name):
with gzip.open(name, 'rb') as f:
try:
file_content = f.read(1)
return len(file_content) > 0
except:
return False
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With