Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Unicode characters for file names inside a zip archive

Tags:

java

file

zip

I am zipping a file name contains some special characters like Péréquation LES HOPITAUX NEUFS.xls to a different folder, say temp.

I am able to zip the file but the problem is the name of file is changing automatically to P+¬r+¬quation LES HOPITAUX NEUFS.xls.

How can I support unicode characters for file names inside a zip archive?

like image 802
Maddy Avatar asked Apr 02 '12 10:04

Maddy


People also ask

What characters are not allowed in Zip files?

Some characters are not permitted when compressing files. Therefore, make sure your file does not contain symbols like “@,” “£,” “♥,” “§,” and similar. Beware that files inside should not hold such symbols either, so you might have to rename a lot… which might be tiring.

What encoding do zip files use?

Zip tools like Winzip or PKZip encode the file names usually in Cp437.

What is a Unicode code character?

Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.

What is a Unicode character stored in?

When a Unicode database is created, CHAR, VARCHAR, LONG VARCHAR, and CLOB data are stored in UTF-8 form, and GRAPHIC, VARGRAPHIC, LONG VARGRAPHIC, and DBCLOB data are stored in UCS-2 big-endian form.


2 Answers

It depends a little bit on what code you're using to create the archive. The old Java compression classes are not so flexible as you need.

You may use Apache Commons Compress. Michael Simons wrote this nice piece of code:

ZipArchiveOutputStream ostream = ...; // Your initialization code here
ostream.setEncoding("Cp437"); // This should handle your "special" characters
ostream.setFallbackToUTF8(true); // For "unknown" characters!
ostream.setUseLanguageEncodingFlag(true);                               
ostream.setCreateUnicodeExtraFields(
    ZipArchiveOutputStream.UnicodeExtraFieldPolicy.NOT_ENCODEABLE);

If you're using Java 7 then you finally have a Charset parameter (that can be UTF-8) on the ZipOutputStream constructor

The big problem, anyway, is that many implementations don't understand Unicode encoding because original ZIP file format is ASCII and there is not an official standard for Unicode. See this post for further details.

like image 83
Adriano Repetti Avatar answered Oct 08 '22 05:10

Adriano Repetti


The Zip specification (historically) does not specify what character encoding to be used for the embedded file names and comments, the original IBM PC character encoding set, commonly referred to as IBM Code Page 437, is supposed to be the only encoding supported. Jar specification meanwhile explicitly specifies to use UTF-8 as the encoding to encode and decode all file names and comments in Jar files. Our java.util.jar and java.util.zip implementation therefor strictly followed Jar specification to use UTF-8 as the sole encoding when dealing with the file names and comments stored in Jar/Zip files.

Consequence? the ZIP file created by "traditional" ZIP tool is not accessible for java.util.jar/zip based tool, and vice versa, if the file name contains characters that are not compatible between Cp437 (as an alternative, tools might simply use the default platform encoding) and UTF-8

For most European, you're "lucky":-) that you only need to avoid a "handful" of characters, such as the umlauts (OK, I'm just kidding), but for Japanese and Chinese, most of the characters are simply out of luck. This is why bug 4244499 had been the No.1 on the Top 25 Java Bugs for so many years. The bug is no longer on the list:-) it has been finally "fixed" in OpenJDK 7, b57. I still keep a snapshot as the record/kudo for myself:-)

The solution (I would use "solution" than a "fix") in JDK7 b57 is to introduce in a new set of ZipInputStream ZipOutStream and ZipFile constructors with a specific "charset" as the parameter, as showed below.

ZipFile(File, Charset)

ZipInputStream(InputStream, Charset)

ZipOutputStream(OutputStream, Charset)

With these new constructors, applications can now access those non-UTF-8 ZIP files via ZipInputStream or ZipFile objects created with the specific encoding, or create a Zip files encoded in non-UTF-8 via the new ZipOutputStream(os, charset) constructor, if necessary.

zip is a stripped-down version of the Jar tool with a "-encoding" option to support non-UTF8 encoding for entry name and comment, it can serve as a demo for how to use the new APIs (I used it as a unit test). I'm still debating with myself if it is a good idea to officially introduce "-encoding" into the Jar tool...

like image 25
dharam Avatar answered Oct 08 '22 04:10

dharam