EDIT: Specifically compression and extraction speeds.
Any Suggestions?
Thanks
So I made a random-ish large zipfile:
$ ls -l *zip
-rw-r--r-- 1 aleax 5000 115749854 Nov 18 19:16 large.zip
$ unzip -l large.zip | wc
23396 93633 2254735
i.e., 116 MB with 23.4K files in it, and timed things:
$ time unzip -d /tmp large.zip >/dev/null
real 0m14.702s
user 0m2.586s
sys 0m5.408s
this is the system-supplied commandline unzip binary -- no doubt as finely-tuned and optimized as a pure C executable can be. Then (after cleaning up /tmp;-)...:
$ time py26 -c'from zipfile import ZipFile; z=ZipFile("large.zip"); z.extractall("/tmp")'
real 0m13.274s
user 0m5.059s
sys 0m5.166s
...and this is Python with its standard library - a bit more demanding of CPU time, but over 10% faster in real, that is, elapsed time.
You're welcome to repeat such measurements of course (on your specific platform -- if it's CPU-poor, e.g a slow ARM chip, then Python's extra demands of CPU time may end up making it slower -- and your specific zipfiles of interest, since each large zipfile will have a very different mix and quite possibly performance). But what this suggests to me is that there isn't that much space to build a Python extension much faster than good old zipfile
-- since Python using it beats the pure-C, system-included unzip!-)
For handling large files without loading them into memory, use the new stream-based methods in Python 2.6's version of zipfile
, such as ZipFile.open
. Don't use extract
or extractall
unless you have strongly sanitised the filenames in the ZIP.
(You used to have to read
all the bytes into memory, or hack around it like zipstream; this is now obsolete.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With