Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I easily compress and decompress Strings to/from byte arrays?

I have some strings that are roughly 10K characters each. There is plenty of repetition in them. They are serialized JSON objects. I'd like to easily compress them into a byte array, and uncompress them from a byte array.

How can I most easily do this? I'm looking for methods so I can do the following:

String original = "....long string here with 10K characters...";
byte[] compressed = StringCompressor.compress(original);
String decompressed = StringCompressor.decompress(compressed);
assert(original.equals(decompressed);
like image 579
Steve McLeod Avatar asked May 13 '12 14:05

Steve McLeod


2 Answers

You can try

enum StringCompressor {
    ;
    public static byte[] compress(String text) {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        try {
            OutputStream out = new DeflaterOutputStream(baos);
            out.write(text.getBytes("UTF-8"));
            out.close();
        } catch (IOException e) {
            throw new AssertionError(e);
        }
        return baos.toByteArray();
    }

    public static String decompress(byte[] bytes) {
        InputStream in = new InflaterInputStream(new ByteArrayInputStream(bytes));
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        try {
            byte[] buffer = new byte[8192];
            int len;
            while((len = in.read(buffer))>0)
                baos.write(buffer, 0, len);
            return new String(baos.toByteArray(), "UTF-8");
        } catch (IOException e) {
            throw new AssertionError(e);
        }
    }
}
like image 73
Peter Lawrey Avatar answered Nov 15 '22 23:11

Peter Lawrey


I made a library to solve the problem of compressing generic Strings (expecially short ones). It tries to compress the String using various algorithms (plain utf-8, 5bit encoding for latin letters, huffman encoding, gzip for long Strings) and chooses the one with the shortest result (in the worst case, it will choose the utf-8 encoding, so that you never risk to lose space).

I hope it may be useful, here's the link https://github.com/lithedream/lithestring

EDIT: I realized that your Strings are always "long", my library defaults on gzip for those sizes, I fear I cannot do better for you.

like image 22
lithedream Avatar answered Nov 15 '22 22:11

lithedream