Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parsing large, possibly compressed, files in Python

I am trying to parse a large file, line by line, for relevant information. I may be receiving either an uncompressed or gzipped file (I may have to edit for zip file at a later stage).

I am using the following code but I feel that, because I am not inside the with statement, I am not parsing the file line by line and am in fact loading the entire file file_content into memory.

if ".gz" in FILE_LIST['INPUT_FILE']:
    with gzip.open(FILE_LIST['INPUT_FILE']) as input_file:
        file_content = input_file.readlines()
else:
    with open(FILE_LIST['INPUT_FILE']) as input_file:
        file_content = input_file.readlines()

for line in file_content:
    # do stuff

Any suggestions for how I should handle this? I would prefer not to unzip the file outside the code block, as this needs to be generic, and I would have to tidy up multiple files.

like image 642
AllynH Avatar asked Aug 21 '17 13:08

AllynH


People also ask

How do I read a compressed file in Python?

open() This function opens a gzip-compressed file in binary or text mode and returns a file like object, which may be physical file, a string or byte object. By default, the file is opened in 'rb' mode i.e. reading binary data, however, the mode parameter to this function can take other modes as listed below.

How do I highly compress files in Python?

To create your own compressed ZIP files, you must open the ZipFile object in write mode by passing 'w' as the second argument. When you pass a path to the write() method of a ZipFile object, Python will compress the file at that path and add it into the ZIP file.

Which module should be used when dealing with large but compressible files?

zlib is a library and Python module that provides code for working with Deflate compression and decompression format which is used by zip , gzip and many others. So, by using this Python module, you're essentially using gzip compatible compression algorithm without the convenient wrapper.


1 Answers

readlines reads the file fully. So it's a no-go for big files.

Doing 2 context blocks like you're doing and then using the input_file handle outside them doesn't work (operation on closed file).

To get best of both worlds, I would use a ternary conditional for the context block (which determines if open or gzip.open must be used), then iterate on the lines.

open_function = gzip.open if ".gz" in FILE_LIST['INPUT_FILE'] else open
with open_function(FILE_LIST['INPUT_FILE'],"r") as input_file:
    for line in input_file:

note that I have added the "r" mode to make sure to work on text not on binary (gzip.open defaults to binary)

Alternative: open_function can be made generic so it doesn't depend on FILE_LIST['INPUT_FILE']:

open_function = lambda f: gzip.open(f,"r") if ".gz" in f else open(f)

once defined, you can reuse it at will

with open_function(FILE_LIST['INPUT_FILE']) as input_file:
    for line in input_file:
like image 108
Jean-François Fabre Avatar answered Oct 08 '22 15:10

Jean-François Fabre