Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Add non-ASCII file names to zip in Java

Tags:

java

encoding

zip

What is the best way to add non-ASCII file names to a zip file using Java, in such a way that the files can be properly read in both Windows and Linux?

Here is one attempt, adapted from https://truezip.dev.java.net/tutorial-6.html#Example, which works in Windows Vista but fails in Ubuntu Hardy. In Hardy the file name is shown as abc-ЖДФ.txt in file-roller.

import java.io.IOException;
import java.io.PrintStream;

import de.schlichtherle.io.File;
import de.schlichtherle.io.FileOutputStream;

public class Main {

    public static void main(final String[] args) throws IOException {

        try {
            PrintStream ps = new PrintStream(new FileOutputStream(
                    "outer.zip/abc-åäö.txt"));
            try {
                ps.println("The characters åäö works here though.");
            } finally {
                ps.close();
            }
        } finally {
            File.umount();
        }
    }
}

Unlike java.util.zip, truezip allows specifying zip file encoding. Here's another sample, this time explicitly specifiying the encoding. Neither IBM437, UTF-8 nor ISO-8859-1 works in Linux. IBM437 works in Windows.

import java.io.IOException;

import de.schlichtherle.io.FileOutputStream;
import de.schlichtherle.util.zip.ZipEntry;
import de.schlichtherle.util.zip.ZipOutputStream;

public class Main {

    public static void main(final String[] args) throws IOException {

        for (String encoding : new String[] { "IBM437", "UTF-8", "ISO-8859-1" }) {
            ZipOutputStream zipOutput = new ZipOutputStream(
                    new FileOutputStream(encoding + "-example.zip"), encoding);
            ZipEntry entry = new ZipEntry("abc-åäö.txt");
            zipOutput.putNextEntry(entry);
            zipOutput.closeEntry();
            zipOutput.close();
        }
    }
}
like image 605
Micke Avatar asked Sep 19 '08 23:09

Micke


People also ask

What is non ascii filenames?

Non-ASCII filenames are stored in a special format called “Unicode”. But in some cases, Unicode offers multiple ways to write things that look exactly the same to humans.

How do I ZIP a file in Java?

Steps to Compress a File in JavaOpen a ZipOutputStream that wraps an OutputStream like FileOutputStream. The ZipOutputStream class implements an output stream filter for writing in the ZIP file format. Put a ZipEntry object by calling the putNextEntry(ZipEntry) method on the ZipOutputStream.

Can Java read ZIP files?

Java API provides extensive support to read Zip files, all classes related to zip file processing are located in the java. util. zip package. One of the most common tasks related to zip archive is to read a Zip file and display what entries it contains, and then extract them in a folder.


5 Answers

The encoding for the File-Entries in ZIP is originally specified as IBM Code Page 437. Many characters used in other languages are impossible to use that way.

The PKWARE-specification refers to the problem and adds a bit. But that is a later addition (from 2007, thanks to Cheeso for clearing that up, see comments). If that bit is set, the filename-entry have to be encoded in UTF-8. This extension is described in 'APPENDIX D - Language Encoding (EFS)', that is at the end of the linked document.

For Java it is a known bug, to get into trouble with non-ASCII-characters. See bug #4244499 and the high number of related bugs.

My colleague used as workaround URL-Encoding for the filenames before storing them into the ZIP and decoding after reading them. If you control both, storing and reading, that may be a workaround.

EDIT: At the bug someone suggests using the ZipOutputStream from Apache Ant as workaround. This implementation allows the specification of an encoding.

like image 108
Mnementh Avatar answered Oct 18 '22 04:10

Mnementh


In Zip files, according to the spec owned by PKWare, the encoding of file names and file comments is IBM437. In 2007 PKWare extended the spec to also allow UTF-8. This says nothing about the encoding of the files contained within the zip. Only the encoding of the filenames.

I think all tools and libraries (Java and non Java) support IBM437 (which is a superset of ASCII), and fewer tools and libraries support UTF-8. Some tools and libs support other code pages. For example if you zip something using WinRar on a computer running in Shanghai, you will get the Big5 code page. This is not "allowed" by the zip spec but it happens anyway.

The DotNetZip library for .NET does Unicode, but of course that doesn't help you if you are using Java!

Using the Java built-in support for ZIP, you will always get IBM437. If you want an archive with something other than IBM437, then use a third party library, or create a JAR.

like image 44
Cheeso Avatar answered Oct 18 '22 05:10

Cheeso


Miracles indeed happen, and Sun/Oracle did really fix the long-living bug/rfe:

Now it's possible to set up filename encodings upon creating the zip file/stream (requires Java 7).

like image 24
Anton Kraievyi Avatar answered Oct 18 '22 04:10

Anton Kraievyi


You can still use the Apache Commons implementation of the zip stream : http://commons.apache.org/compress/apidocs/org/apache/commons/compress/archivers/zip/ZipArchiveOutputStream.html#setEncoding%28java.lang.String%29

Calling setEncoding("UTF-8") on your stream should be enough.

like image 26
Fengtan Avatar answered Oct 18 '22 05:10

Fengtan


From a quick look at the TrueZIP manual - they recommend the JAR format:

It uses UTF-8 for file name encoding and comments - unlike ZIP, which only uses IBM437.

This probably means that the API is using the java.util.zip package for its implementation; that documentation states that it is still using a ZIP format from 1996. Unicode support wasn't added to the PKWARE .ZIP File Format Specification until 2006.

like image 3
McDowell Avatar answered Oct 18 '22 03:10

McDowell