Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GZIPInputStream and Characterset

I have a Text with Latin, Cyrillic and Chinese Characters containing. I try to compress a String (over bytes[]) with GZIPOutputStream and decompress it with GZIPInputStream. But I do not manage to convert all Characters back to the original Characters. Some appear as ?.

I thought that UTF-16 will do the job.

Any help?

Regards

Here's my code:

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.UnsupportedEncodingException;
import java.util.zip.DataFormatException;
import java.util.zip.Deflater;
import java.util.zip.GZIPInputStream;
import java.util.zip.GZIPOutputStream;
import java.util.zip.Inflater;
import java.util.zip.ZipException;

public class CompressUncompressStrings {

    public static void main(String[] args) throws UnsupportedEncodingException {

        String sTestString="äöüäöü 长安";
        System.out.println(sTestString);
        byte bcompressed[]=compress(sTestString.getBytes("UTF-16"));
        //byte bcompressed[]=compress(sTestString.getBytes());
        String sDecompressed=decompress(bcompressed);
        System.out.println(sDecompressed);
    }
    public static byte[] compress(byte[] content){
        ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
        try{
            GZIPOutputStream gzipOutputStream = new GZIPOutputStream(byteArrayOutputStream);
            gzipOutputStream.write(content);
            gzipOutputStream.close();
        } catch(IOException e){
            throw new RuntimeException(e);
        }
        return byteArrayOutputStream.toByteArray();
    }
    public static String decompress(byte[] contentBytes){

        String sReturn="";
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        try{
            GZIPInputStream gzipInputStream =new GZIPInputStream(new ByteArrayInputStream(contentBytes));
             ByteArrayOutputStream baos = new ByteArrayOutputStream();
             for (int value = 0; value != -1;) {
                 value = gzipInputStream.read();
                 if (value != -1) {
                     baos.write(value);
                 }
             }
             gzipInputStream.close();
             baos.close();
             sReturn=new String(baos.toByteArray(), "UTF-16");
             return sReturn;
                 // Ende Neu

        } catch(IOException e){
            throw new RuntimeException(e);
        }
    }
}
like image 470
mcflysoft Avatar asked Nov 04 '22 16:11

mcflysoft


2 Answers

I suspect it's just the console that's having a problem. I tried the above code, and although it didn't print out any of the characters properly, when I tested the round-tripping of the string, it was fine:

System.out.println(sDecompressed.equals(sTestString)); // Prints true

What does that do on your machine?

like image 82
Jon Skeet Avatar answered Nov 12 '22 12:11

Jon Skeet


Displaying an non ASCII character on a console output is not easy. Assuming you're using Windows as your operating system (since the command line doesn't support Unicode by default), you can change your active code page number (using the chcp command). I don't know how it's done through code but I suggest running the code on command line.

This chcp value 65001 changes to tell windows to use UTF-8 on it's console (you can view a discussion here).

I hope this helps.

like image 37
Buhake Sindi Avatar answered Nov 12 '22 12:11

Buhake Sindi