Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can you reversibly compress a bit of text into fewer ASCII characters?

Tags:

algorithm

ruby

I want to take an arbitrary string of ASCII text, like "Hello world", and compress it into a version with fewer characters (as few as possible), but in a way that it can be decompressed. The compressed version should be composed only of ascii characters. Is there a way to accomplish this, especially in Ruby?

like image 348
dan Avatar asked Jan 27 '11 03:01

dan


3 Answers

If you know that only ASCII characters will be used, that is the 7 low order bits of every byte. With bit manipulation, you can mash every 8 bytes into 7 (12.5% savings). If you can get it into a smaller range (64 valid chars only), you can drop another byte.

However, because you want the compressed form to ALSO contain only ASCII characters, that loses you one byte - which goes back to square one unless your input can be restricted to 64-chars (e.g. lossy compression substituting some chars with others, storing only in lower case etc).

If your strings are not large (>1k), then there is minimal savings to be had using gzip/bzip2 etc because of the size of the headers. If you had a predefined dictionary to use as a Huffman table, you may get some compression but in other cases, you can get bloat against the original text.

Prior discussion on SO An efficient compression algorithm for short text strings

like image 95
RichardTheKiwi Avatar answered Nov 15 '22 05:11

RichardTheKiwi


There are many good text compression algorithms like Huffman encoding or LZW that are good at compressing text strings into bitstrings with many fewer bits than the standard ASCII encoding. Once you have such an encoding, you can always split the bitstring up into groups of seven bits to pack them into standard ASCII characters. I'm sure that there are libraries out there that do this, but I'm not much of a Ruby coder and don't know of any off the top of my head.

like image 31
templatetypedef Avatar answered Nov 15 '22 04:11

templatetypedef


The simplest way to do this would be to compress it using a standard algorithm, then base64 encode the result. This isn't likely to help on a string as short as 'Hello world', though - at that size, there's very little you could do to decrease the size, unless all your strings have a similar restricted character set, or patterns that something like huffman encoding can take advantage of.

like image 38
Nick Johnson Avatar answered Nov 15 '22 04:11

Nick Johnson