I am trying to parse a large file, line by line, for relevant information. I may be receiving either an uncompressed or gzipped file (I may have to edit for zip file at a later stage).
I am using the following code but I feel that, because I am not inside the with
statement, I am not parsing the file line by line and am in fact loading the entire file file_content
into memory.
if ".gz" in FILE_LIST['INPUT_FILE']:
with gzip.open(FILE_LIST['INPUT_FILE']) as input_file:
file_content = input_file.readlines()
else:
with open(FILE_LIST['INPUT_FILE']) as input_file:
file_content = input_file.readlines()
for line in file_content:
# do stuff
Any suggestions for how I should handle this? I would prefer not to unzip the file outside the code block, as this needs to be generic, and I would have to tidy up multiple files.
open() This function opens a gzip-compressed file in binary or text mode and returns a file like object, which may be physical file, a string or byte object. By default, the file is opened in 'rb' mode i.e. reading binary data, however, the mode parameter to this function can take other modes as listed below.
To create your own compressed ZIP files, you must open the ZipFile object in write mode by passing 'w' as the second argument. When you pass a path to the write() method of a ZipFile object, Python will compress the file at that path and add it into the ZIP file.
zlib is a library and Python module that provides code for working with Deflate compression and decompression format which is used by zip , gzip and many others. So, by using this Python module, you're essentially using gzip compatible compression algorithm without the convenient wrapper.
readlines
reads the file fully. So it's a no-go for big files.
Doing 2 context blocks like you're doing and then using the input_file
handle outside them doesn't work (operation on closed file).
To get best of both worlds, I would use a ternary conditional for the context block (which determines if open
or gzip.open
must be used), then iterate on the lines.
open_function = gzip.open if ".gz" in FILE_LIST['INPUT_FILE'] else open
with open_function(FILE_LIST['INPUT_FILE'],"r") as input_file:
for line in input_file:
note that I have added the "r" mode to make sure to work on text not on binary (gzip.open
defaults to binary)
Alternative: open_function
can be made generic so it doesn't depend on FILE_LIST['INPUT_FILE']
:
open_function = lambda f: gzip.open(f,"r") if ".gz" in f else open(f)
once defined, you can reuse it at will
with open_function(FILE_LIST['INPUT_FILE']) as input_file:
for line in input_file:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With