Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python zipfile module can't extract filenames with Chinese characters

I'm trying to use a python script to download files from a Chinese service provider (I'm not from China myself). The provider is giving me a .zip file which contains a file which seems to have Chinese characters in its name. This seems to be causing the zipfile module to barf.

Code:

import zipfile

f = "/path/to/zip_file.zip"

if zipfile.is_zipfile(f):
    fz = zipfile.ZipFile(f, 'r')

The zipfile itself doesn't contain any non-ASCII characters but the file inside it does. When I run the above script i get the following exception:

Traceback (most recent call last):   File "./temp.py", line 9, in <module>
    fz = zipfile.ZipFile(f, 'r')   File "/usr/lib/python2.7/zipfile.py", line 770, in __init__
    self._RealGetContents()   File "/usr/lib/python2.7/zipfile.py", line 859, in _RealGetContents
    x.filename = x._decodeFilename()   File "/usr/lib/python2.7/zipfile.py", line 379, in _decodeFilename
    return self.filename.decode('utf-8')   File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0xbd in position 30: invalid start byte

I've tried looking through the answers to many similar questions:

  • Read file with Chinese Characters
  • Extract zip files with non-unicode filenames
  • Extract files with invalid characters

Please correct me if I'm wrong, but it looks like an open issue with the zipfile module.

How do I get around this? Is there any alternative module for dealing with zipfiles that I should use? Or any other solution?

TIA.

Edit: I can access/unzip the same file perfectly with the linux command-line utility "unzip".

like image 508
hyperwiser Avatar asked Dec 07 '16 14:12

hyperwiser


People also ask

What does ZIP file ZIP file do?

The ZIP file format is a common archive and compression standard. This module provides tools to create, read, write, append, and list a ZIP file. Any advanced use of this module will require an understanding of the format, as defined in PKZIP Application Note.

How do I import a ZIP file module in Python?

If you want to import modules and packages from a ZIP file, then you just need the file to appear in Python's module search path. The module search path is a list of directories and ZIP files. It lives in sys. path .

Is ZIP file built in Python?

Python also provides a high-level module called zipfile specifically designed to create, read, write, extract, and list the content of ZIP files.


2 Answers

The way of Python 2.x(2.7) and Python 3.x dealing with non utf-8 filename in module zipfile are a bit different.

First, they both check ZipInfo.flag_bits of the file, if ZipInfo.flag_bits & 0x800, name of the file will be decode with utf-8.

If the check of above is False, in Python 2.x, the byte string of the name will be returned; in Python 3.x, the module will decode the file with encoding cp437 and return decoded result. Of course, the module will not know the true encoding of the filename in both Python versions.

So, suppose you have got a filename from a ZipInfo object or zipfile.namelist method, and you have already know the filename is encoded with XXX encoding. Those are the ways you get the correct unicode filename:

# in python 2.x
filename = filename.decode('XXX')


# in python 3.x
filename = filename.encode('cp437').decode('XXX')
like image 72
socrates Avatar answered Nov 15 '22 09:11

socrates


Recently I met the same problem. Here is my solution. I hope it is useful for you.

import shutil
import zipfile
f = zipfile.ZipFile('/path/to/zip_file.zip', 'r')
for fileinfo in f.infolist():
    filename = fileinfo.filename.encode('cp437').decode('gbk')
    outputfile = open(filename, "wb")
    shutil.copyfileobj(f.open(fileinfo.filename), outputfile)
    outputfile.close()
f.close()

UPDATE: You can use the following simpler solution with pathlib:

from pathlib import Path
import zipfile

with zipfile.ZipFile('/path/to/zip_file.zip', 'r') as f:
    for fn in f.namelist():
        extracted_path = Path(f.extract(fn))
        extracted_path.rename(fn.encode('cp437').decode('gbk'))
like image 35
secsilm Avatar answered Nov 15 '22 09:11

secsilm