Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read from a zip file within zip file in Python? [duplicate]

I have a file that I want to read that is itself zipped within a zip archive. For example, parent.zip contains child.zip, which contains child.txt. I am having trouble reading child.zip. Can anyone correct my code?

I assume that I need to create child.zip as a file-like object and then open it with a second instance of zipfile, but being new to python my zipfile.ZipFile(zfile.open(name)) is silly. It raises a zipfile.BadZipfile: "File is not a zip file" on (independently validated) child.zip

import zipfile with zipfile.ZipFile("parent.zip", "r") as zfile:     for name in zfile.namelist():         if re.search(r'\.zip$', name) is not None:             # We have a zip within a zip             with **zipfile.ZipFile(zfile.open(name))** as zfile2:                     for name2 in zfile2.namelist():                         # Now we can extract                         logging.info( "Found internal internal file: " + name2)                         print "Processing code goes here" 
like image 940
Michael Collinson Avatar asked Aug 19 '12 09:08

Michael Collinson


People also ask

How do I extract text from a ZIP file in Python?

We create a ZipFile object in READ mode and name it as zip. printdir() method prints a table of contents for the archive. extractall() method will extract all the contents of the zip file to the current working directory. You can also call extract() method to extract any file by specifying its path in the zip file.


1 Answers

When you use the .open() call on a ZipFile instance you indeed get an open file handle. However, to read a zip file, the ZipFile class needs a little more. It needs to be able to seek on that file, and the object returned by .open() is not seekable in your case. Only Python 3 (3.2 and up) produces a ZipExFile object that supports seeking (provided the underlying file handle for the outer zip file is seekable, and nothing is trying to write to the ZipFile object).

The workaround is to read the whole zip entry into memory using .read(), store it in a BytesIO object (an in-memory file that is seekable) and feed that to ZipFile:

from io import BytesIO  # ...         zfiledata = BytesIO(zfile.read(name))         with zipfile.ZipFile(zfiledata) as zfile2: 

or, in the context of your example:

import zipfile from io import BytesIO  with zipfile.ZipFile("parent.zip", "r") as zfile:     for name in zfile.namelist():         if re.search(r'\.zip$', name) is not None:             # We have a zip within a zip             zfiledata = BytesIO(zfile.read(name))             with zipfile.ZipFile(zfiledata) as zfile2:                 for name2 in zfile2.namelist():                     # Now we can extract                     logging.info( "Found internal internal file: " + name2)                     print "Processing code goes here" 
like image 54
Martijn Pieters Avatar answered Sep 19 '22 15:09

Martijn Pieters