Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read a large zipped text file line by line in python

Tags:

python

stream

zip

I am trying to use zipfile module to read a file in an archive. the uncompressed file is ~3GB and the compressed file is 200MB. I don't want them in memory as I process the compressed file line by line. So far I have noticed a memory overuse using the following code:

import zipfile
f = open(...)
z = zipfile.ZipFile(f)
for line in zipfile.open(...).readlines()
  print line

I did it in C# using the SharpZipLib:

var fStream = File.OpenRead("...");
var unzipper = new ICSharpCode.SharpZipLib.Zip.ZipFile(fStream);
var dataStream =  unzipper.GetInputStream(0);

dataStream is uncompressed. I can't seem to find a way to do it in Python. Help will be appreciated.

like image 542
Sonia Avatar asked Jul 14 '12 08:07

Sonia


People also ask

How do I read a text file line by line in Python?

Method 1: Read a File Line by Line using readlines() readlines() is used to read all the lines at a single go and then return them as each line a string element in a list. This function can be used for small files, as it reads the whole file content to the memory, then split it into separate lines.

How do you iterate through a large file in Python?

use of with with is the nice and efficient pythonic way to read large files. advantages - 1) file object is automatically closed after exiting from with execution block. 2) exception handling inside the with block. 3) memory for loop iterates through the f file object line by line.

How do you read multiple lines in a text file in Python?

To read multiple lines, call readline() multiple times. The built-in readline() method return one line at a time. To read multiple lines, call readline() multiple times.


1 Answers

Python file objects provide iterators, which will read line by line. file.readlines() reads them all and returns a list - which means it needs to read everything into memory. The better approach (which should always be preferred over readlines()) is to just loop over the object itself, E.g:

import zipfile with zipfile.ZipFile(...) as z:     with z.open(...) as f:         for line in f:             print line 

Note my use of the with statement - file objects are context managers, and the with statement lets us easily write readable code that ensures files are closed when the block is exited (even upon exceptions). This, again, should always be used when dealing with files.

like image 83
Gareth Latty Avatar answered Oct 14 '22 16:10

Gareth Latty