Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Java zip character encoding

Tags:

java

encoding

zip

I'm using the following method to compress a file into a zip file:

import java.util.zip.CRC32;
import java.util.zip.ZipEntry;
import java.util.zip.ZipOutputStream;

public static void doZip(final File inputfis, final File outputfis) throws IOException {

    FileInputStream fis = null;
    FileOutputStream fos = null;

    final CRC32 crc = new CRC32();
    crc.reset();

    try {
        fis = new FileInputStream(inputfis);
        fos = new FileOutputStream(outputfis);
        final ZipOutputStream zos = new ZipOutputStream(fos);
        zos.setLevel(6);
        final ZipEntry ze = new ZipEntry(inputfis.getName());
        zos.putNextEntry(ze);
        final int BUFSIZ = 8192;
        final byte inbuf[] = new byte[BUFSIZ];
        int n;
        while ((n = fis.read(inbuf)) != -1) {
            zos.write(inbuf, 0, n);
            crc.update(inbuf);
        }
        ze.setCrc(crc.getValue());
        zos.finish();
        zos.close();
    } catch (final IOException e) {
        throw e;
    } finally {
        if (fis != null) {
            fis.close();
        }
        if (fos != null) {
            fos.close();
        }
    }
}

My problem is that i have flat text files with the content N°TICKET for example, the zipped result gives some weired characters when uncompressed N° TICKET. Also characters such as é and à are not supported.

I guess it's due to the character encoding, but I don't know how to set it in my zip method to ISO-8859-1 ?

(I'm running on windows 7, java 6)

like image 647
Majid Laissi Avatar asked Oct 08 '12 17:10

Majid Laissi


People also ask

What encoding does zip use?

Zip tools like Winzip or PKZip encode the file names usually in Cp437.

How do I know the encoding of a zip file?

So the only way is to check if filename contains something like utf-8 characters (check description of utf8 encoding - first byte should be 110xxxxx, second - 10xxxxxx for 2-bytes encoded chars). If it is correct utf8 string - use utf8 encoding. If not - fall back to OEM/DOS encoding. Save this answer.

Which encoding is used in Java?

encoding attribute, Java uses “UTF-8” character encoding by default. Character encoding basically interprets a sequence of bytes into a string of specific characters. The same combination of bytes can denote different characters in different character encoding.

What is encoding in Java?

Encoding is a way to convert data from one format to another. String objects use UTF-16 encoding. The problem with UTF-16 is that it cannot be modified. There is only one way that can be used to get different encoding i.e. byte[] array. The way of encoding is not suitable if we get unexpected data.


1 Answers

You are using streams which write exactly the bytes that they are given. Writers interpret character data and convert it to the corresponding bytes and Readers do the opposite. Java (at least in version 6) doesn't provide an easy way to to mix and match operations on zipped data and for writing characters.

This way will work though. It is, however, a little clunky.

File inputFile = new File("utf-8-data.txt");
File outputFile = new File("latin-1-data.zip");

ZipEntry entry = new ZipEntry("latin-1-data.txt");

BufferedReader reader = new BufferedReader(new FileReader(inputFile));

ZipOutputStream zipStream = new ZipOutputStream(new FileOutputStream(outputFile));
BufferedWriter writer = new BufferedWriter(
    new OutputStreamWriter(zipStream, Charset.forName("ISO-8859-1"))
);

zipStream.putNextEntry(entry);

// this is the important part:
// all character data is written via the writer and not the zip output stream
String line = null;
while ((line = reader.readLine()) != null) {
    writer.append(line).append('\n');
}
writer.flush(); // i've used a buffered writer, so make sure to flush to the
// underlying zip output stream

zipStream.closeEntry();
zipStream.finish();

reader.close(); 
writer.close();
like image 172
Dunes Avatar answered Sep 24 '22 20:09

Dunes