Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compressing unicode characters

I am using GZIPOutputStream in my java program to compress big strings, and finally storing it in database.

I can see that while compressing English text, I am achieving 1/4 to 1/10 compression ration (depending on the string value). So say for example my original English text is 100kb, then on an average compressed text will be somewhere around 30kb.

But when I am compressing unicode characters, the compressed string is actually occupying more bytes than the original string. Say for example, my original unicode string is 100kb, then the compressed version is coming out to 200kb.

Unicode string example: "嗨,这是,短信计数测试持续for.Hi这是短"

Can anyone suggest that how can I achieve compression for unicode text as well? and why the compressed version is actually bigger than the original version?

My compression code in Java:

            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            GZIPOutputStream zos = new GZIPOutputStream(baos);

            zos.write(text.getBytes("UTF-8"));
            zos.finish();
            zos.flush();

            byte[] udpBuffer = baos.toByteArray();
like image 298
Arry Avatar asked Nov 02 '22 01:11

Arry


1 Answers

Java's GZIPOutputStream uses the Deflate compression algorithm to compress data. Deflate is a combination of LZ77 and Huffman coding. According to Unicode's Compression FAQ:

Q: What's wrong with using standard compression algorithms such as Huffman coding or patent-free variants of LZW?

A: SCSU bridges the gap between an 8-bit based LZW and a 16-bit encoded Unicode text, by removing the extra redundancy that is part of the encoding (sequences of every other byte being the same) and not a redundancy in the content. The output of SCSU should be sent to LZW for block compression where that is desired.

To get the same effect with one of the popular general purpose algorithms, like Huffman or any of the variants of Lempel-Ziv compression, it would have to be retargeted to 16-bit, losing effectiveness due to the larger alphabet size. It's relatively easy to work out the math for the Huffman case to show how many extra bits the compressed text would need just because the alphabet was larger. Similar effects exist for LZW. For a detailed discussion of general text compression issues see the book Text Compression by Bell, Cleary and Witten (Prentice Hall 1990).

I was able to find this set of Java classes for SCSU compression on the unicode website, which may be useful to you, however I couldn't find a .jar library that you could easily import into your project, though you can probably package them into one if you like.

like image 190
JonK Avatar answered Nov 13 '22 06:11

JonK