Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What Is The Best Python Zip Module To Handle Large Files?

EDIT: Specifically compression and extraction speeds.

Any Suggestions?

Thanks

like image 850
Duck Avatar asked Nov 18 '09 22:11

Duck


2 Answers

So I made a random-ish large zipfile:

$ ls -l *zip
-rw-r--r--  1 aleax  5000  115749854 Nov 18 19:16 large.zip
$ unzip -l large.zip | wc
   23396   93633 2254735

i.e., 116 MB with 23.4K files in it, and timed things:

$ time unzip -d /tmp large.zip >/dev/null

real    0m14.702s
user    0m2.586s
sys         0m5.408s

this is the system-supplied commandline unzip binary -- no doubt as finely-tuned and optimized as a pure C executable can be. Then (after cleaning up /tmp;-)...:

$ time py26 -c'from zipfile import ZipFile; z=ZipFile("large.zip"); z.extractall("/tmp")'

real    0m13.274s
user    0m5.059s
sys         0m5.166s

...and this is Python with its standard library - a bit more demanding of CPU time, but over 10% faster in real, that is, elapsed time.

You're welcome to repeat such measurements of course (on your specific platform -- if it's CPU-poor, e.g a slow ARM chip, then Python's extra demands of CPU time may end up making it slower -- and your specific zipfiles of interest, since each large zipfile will have a very different mix and quite possibly performance). But what this suggests to me is that there isn't that much space to build a Python extension much faster than good old zipfile -- since Python using it beats the pure-C, system-included unzip!-)

like image 66
Alex Martelli Avatar answered Nov 15 '22 20:11

Alex Martelli


For handling large files without loading them into memory, use the new stream-based methods in Python 2.6's version of zipfile, such as ZipFile.open. Don't use extract or extractall unless you have strongly sanitised the filenames in the ZIP.

(You used to have to read all the bytes into memory, or hack around it like zipstream; this is now obsolete.)

like image 41
bobince Avatar answered Nov 15 '22 20:11

bobince