Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compress data into smallest amount of text?

Tags:

python

I have data (mostly a series of numpy arrays) that I want to convert into text that can be copied/pasted/emailed etc.. I created the following formula which does this.

def convert_to_ascii85(x):
    p = pickle.dumps(x)
    p = zlib.compress(p)
    return b64.b85encode(p)

My issue is that the string it produces is longer than it needs to be because it only uses a subset of letters, numbers, and symbols. If I was able to encode using unicode, I feel like it could produce a shorter string because it would have access to more characters. Is there a way to do this?

Edit to clarify: My goal is NOT the smallest amount of data/information/bytes. My goal is the smallest number of characters. The reason is that the channel I'm sending the data through is capped by characters (100k to be precise) instead of bytes (strange, I know). I've already tested that I can send 100k unicode characters, I just don't know how to convert my bytes into unicode.

like image 678
Allen Avatar asked Jun 23 '19 22:06

Allen


1 Answers

UPDATE: I just saw that you changed your question to clarify that you care about character length rather than byte length. This is a really strange constraint. I've never heard of it before. I don't know quite what to make of it. But if that's your need, and you want predicable blocking behavior, then I'm thinking that your problem is pretty simple. Just pick the compatible character encoding that can represent the most possible unique characters, and then map blocks of your binary across that character set such that each block is the longest it can be and yet consists of fewer bits than the number of representable characters in your character encoding. Each such block then becomes a single character. Since this constraint is kinda strange, I don't know if there are libraries out there that do this.

UPDATE2: Being curious about the above myself, I just Google'd and found this: https://qntm.org/unicodings. If your tools and communication channels can deal with UFT-16 or UTF-32, then you might be onto something in seeking to use that. If so, I hope this article opens up to the solution you're looking for. I think this article is still optimizing for byte length vs character length, so maybe this won't provide the optimal solution, but it can only help (32 potential bits per char rather than 7 or 8). I couldn't find anything seeking to optimize on character count alone, but maybe a UTF-32 scheme like Base65536 is your answer. Check out https://github.com/qntm/base65536 .

If it is byte length that you cared about, and you want to stick to using what is usually meant by "printable characters" or "plain printable text", then here's my original answer...

There are options for getting better "readable text" encoding space efficiency from an encoding other than Base85. There's also a case to be made for giving up more space efficiency and going with Base64. Here I'll make the case for using both Base85 and Base64. If you can use Base85, you only take a 25% hit on the inflation of your binary, and you save a whole lot of headaches in doing so.

Base85 is pretty close to the best you're going to do if you seek to encode arbitrary binary to "plain text", and it is the BEST you can do if you want a "plain text" encoding that you can logically break into meaningful, predictable chunks. You can in theory use a character set that uses printable characters in the high-ASCII range, but experience has shown that many tools and communication channels don't deal well with high-ASCII if they can't handle straight binary. You don't get much in additional space savings for trying to use the extra 5 bits per 4 binary bytes or so that could potentially be used by using 256-bit high-ASCII vs 128-bit ASCII.

For any BaseXX encoding, the algorithm takes incoming binary bits and encodes them as tightly as it can using the XX printable characters it has at its disposal. Base85 will be more compact than Base64 because it uses more of the printable characters (85) than Base64 does (64 characters).

There are 95 printable characters in standard ASCII. So there is a Base95 that is the most compact encoding possible using all the printable characters. But to try to use all 95 bits is messy, because it leads to uneven blockings of the incoming bits. Each 4 binary bytes is mapped to some fractional number of characters less than 5.

It turns out that 85 characters is what you need to encode 4 bytes as exactly 5 printable characters. Many will choose to add about 10% of extra length to attain the fact that every 4 encoded bytes leads to exactly 5 ASCII characters. This is only a 25% inflation in size of the binary. That's not bad at all for all of the headaches it saves. Hence, the motivation behind Base85.

Base64 is used to produce longer, but even less problematic encodings. Characters that cause trouble for various text documents, like HTML, XML, JSON, etc, are not used. In this way, Base64 is useful in almost any context without any escaping. You have to be more careful with Base85, as it doesn't throw out any of these problematic characters. For encoding/decoding efficiency, it uses the range 33 (“!”) through 117 (‘u’), starting at 33 rather than 32 just to avoid the often problematic space character. The characters above 'u' that it doesn't use are nothing special.

So that's the story pretty much on binary -> ASCII encoding side. The other question is what you can do to reduce the size of what you're representing prior to the stage of encoding its binary representation to ASCII. You're choosing to use pickle.dumps() and zlib.compress(). If those are your best choices are left for another discussion...

like image 102
CryptoFool Avatar answered Oct 02 '22 10:10

CryptoFool