Parsing large, possibly compressed, files in Python

Tags:

I am trying to parse a large file, line by line, for relevant information. I may be receiving either an uncompressed or gzipped file (I may have to edit for zip file at a later stage).

I am using the following code but I feel that, because I am not inside the with statement, I am not parsing the file line by line and am in fact loading the entire file file_content into memory.

if ".gz" in FILE_LIST['INPUT_FILE']:
    with gzip.open(FILE_LIST['INPUT_FILE']) as input_file:
        file_content = input_file.readlines()
else:
    with open(FILE_LIST['INPUT_FILE']) as input_file:
        file_content = input_file.readlines()

for line in file_content:
    # do stuff

Any suggestions for how I should handle this? I would prefer not to unzip the file outside the code block, as this needs to be generic, and I would have to tidy up multiple files.

642

asked Aug 21 '17 13:08

AllynH

1 Answers

readlines reads the file fully. So it's a no-go for big files.

Doing 2 context blocks like you're doing and then using the input_file handle outside them doesn't work (operation on closed file).

To get best of both worlds, I would use a ternary conditional for the context block (which determines if open or gzip.open must be used), then iterate on the lines.

open_function = gzip.open if ".gz" in FILE_LIST['INPUT_FILE'] else open
with open_function(FILE_LIST['INPUT_FILE'],"r") as input_file:
    for line in input_file:

note that I have added the "r" mode to make sure to work on text not on binary (gzip.open defaults to binary)

Alternative: open_function can be made generic so it doesn't depend on FILE_LIST['INPUT_FILE']:

open_function = lambda f: gzip.open(f,"r") if ".gz" in f else open(f)

once defined, you can reuse it at will

with open_function(FILE_LIST['INPUT_FILE']) as input_file:
    for line in input_file:

108

answered Oct 08 '22 15:10

Jean-François Fabre

Related questions
                            
                                Tkinter Scale slider with float values doesn't work with locale of language that uses comma for floats
                            
                                What are noisy samples in Scikit's DBSCAN clustering algorithm?
                            
                                pandas map column data based on value from another column using if to determine which dict to use
                            
                                Python 3.6 tkinter window icon on Linux error
                            
                                create pirate plot in seaborn (combination of box and point plot)
                            
                                Unknown column 'nan' in 'field list' python pandas
                            
                                How can I multiply a n*m DataFrame with a 1*m DataFrame in pandas?
                            
                                Mark test to be run in independent process
                            
                                How to add borders to a table in excel sheet created by pandas dataframe?
                            
                                Delete python environment
                            
                                Keras log_loss error is same
                            
                                How to check if a Jupyter Notebook extension is enabled?
                            
                                Django - (1366, "Incorrect string value:... error
                            
                                Error, 'only list-like objects are allowed to be passed to isin(), you passed a [int]'
                            
                                Move files in folders to a top-level directory
                            
                                Sorting items by drag and drop in django
                            
                                Virtual Environment For Installing Tensorflow : Why Do I need it for Whiich Purpose?
                            
                                Set Python Logging to overwrite log file when using dictConfig?
                            
                                How to write JSON data to Dynamodb by ignoring empty elements in boto3
                            
                                Python decorators just syntactic sugar? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parsing large, possibly compressed, files in Python

Tags:

python

gzip

python-2.7

AllynH

People also ask

1 Answers

Jean-François Fabre

Recent Activity

Donate For Us