Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Incrementally Read Large Multipart Zipped Text File in Python

I have a very large zip file that is split into multiple parts as split archives, with a single file within the archive. I do not have enough resources to combine these archives together or extract them (the raw text file is nearly 1TB).

I would like to parse the text file line by line, ideally using something like this:

import zipfile
for zipfilename in filenames:
    with zipfile.ZipFile(zipfilename) as z:
        with z.open(...) as f:
            for line in f:
                print line

Is this possible? If so, how can I read the text file:

  1. Without using too much memory (loading the whole file into memory is obviously out of the question)
  2. Without extracting any of the zip files
  3. (Ideally) Without combining the zip files

Thank you in advance for your help.

like image 731
Jon Avatar asked Mar 13 '13 18:03

Jon


People also ask

How do you process a large text file in Python?

Reading Large Text Files in Python We can use the file object as an iterator. The iterator will return each line one by one, which can be processed. This will not read the whole file into memory and it's suitable to read large files in Python.

What is ZIP file in Python?

Python's zipfile is a standard library module intended to manipulate ZIP files. This file format is a widely adopted industry standard when it comes to archiving and compressing digital data. You can use it to package together several related files.


1 Answers

I'll take a stab.

If your zip files are the so-called "split archives" according to the Zip file format, you won't be able to read them either with Python's zipfile library nor with the unzip terminal command.

If, on the other hand, you are dealing with a single zip archive that has been split using the split command or a similar byte-splitting device, you might be able to extract and read its contents on the fly in Python.

You will have to write a "file-like" custom class that will accept the seek() and read() methods (and possibly others) and perform them on the split chunks.

seek() will need to compute which zip file to read, open it (if it's not the current file still open) and perform a seek() on it using the difference in offsets.

read() will read from the chunk that is currently open, dealing with the End of file condition, which will cause it to open the next chunk and complete the read on it.

After you write and test this class, it will just be a matter of calling the ZipFile constructor passing an instance of your class as the "virtual zip" file object to open.

like image 198
Tobia Avatar answered Nov 14 '22 22:11

Tobia