I want to take an arbitrary string of ASCII text, like "Hello world", and compress it into a version with fewer characters (as few as possible), but in a way that it can be decompressed. The compressed version should be composed only of ascii characters. Is there a way to accomplish this, especially in Ruby?
If you know that only ASCII characters will be used, that is the 7 low order bits of every byte. With bit manipulation, you can mash every 8 bytes into 7 (12.5% savings). If you can get it into a smaller range (64 valid chars only), you can drop another byte.
However, because you want the compressed form to ALSO contain only ASCII characters, that loses you one byte - which goes back to square one unless your input can be restricted to 64-chars (e.g. lossy compression substituting some chars with others, storing only in lower case etc).
If your strings are not large (>1k), then there is minimal savings to be had using gzip/bzip2 etc because of the size of the headers. If you had a predefined dictionary to use as a Huffman table, you may get some compression but in other cases, you can get bloat against the original text.
Prior discussion on SO An efficient compression algorithm for short text strings
There are many good text compression algorithms like Huffman encoding or LZW that are good at compressing text strings into bitstrings with many fewer bits than the standard ASCII encoding. Once you have such an encoding, you can always split the bitstring up into groups of seven bits to pack them into standard ASCII characters. I'm sure that there are libraries out there that do this, but I'm not much of a Ruby coder and don't know of any off the top of my head.
The simplest way to do this would be to compress it using a standard algorithm, then base64 encode the result. This isn't likely to help on a string as short as 'Hello world', though - at that size, there's very little you could do to decrease the size, unless all your strings have a similar restricted character set, or patterns that something like huffman encoding can take advantage of.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With