Store arbitrary binary data on a system accepting only valid UTF8

Question

I have arbitrary binary data. I need to store it in a system that expects valid UTF8. It will never be interpreted as text, I just need to put it in there and be able to retrieve it and reconstitute my binary data.

Obviously base64 would work, but I can't have that much inflation.

How can I easily achieve this in python 2.7?

Obviously base64 would work, but I can't have that much inflation.

How can I easily achieve this in python 2.7?

Martijn Pieters · Accepted Answer

You'll have to express your data using just ASCII characters. Using Base64 is the most efficient method (available in the Python standard library) to do this, in terms of making binary data fit in printable text that is also UTF-8 safe. Sure, it requires 33% more space to express the same data, but other methods take more additional space.

You can combine this with compression to limit how much space this is going to take, but make the compression optional (mark the data) and only actually use it if the data is going to be smaller.

import zlib
import base64

def pack_utf8_safe(data):
    is_compressed = False
    compressed = zlib.compress(data)
    if len(compressed) < (len(data) - 1):
        data = compressed
        is_compressed = True
    base64_encoded = base64.b64encode(data)
    if is_compressed:
        base64_encoded = '.' + base64_encoded
    return base64_encoded

def unpack_utf8_safe(base64_encoded):
    decompress = False
    if base64_encoded.startswith('.'):
        base64_encoded = base64_encoded[1:]
        decompress = True
    data = base64.b64decode(base64_encoded)
    if decompress:
        data = zlib.decompress(data)
    return data

The '.' character is not part of the Base64 alphabet, so I used it here to mark compressed data.

You could further shave of the 1 or 2 = padding characters from the end of the Base64 encoded data; these can then be re-added when decoding (add '=' * (-len(encoded) * 4) to the end), but I'm not sure that's worth the bother.

You can achieve further savings by switching to the Base85 encoding, a 4-to-5 ratio ASCII-safe encoding for binary data, so a 20% overhead. For Python 2.7 this is only available in an external library (Python 3.4 added it to the base64 library). You can use python-mom project in 2.7:

from mom.codec import base85

and replace all base64.b64encode() and base64.b64decode() calls with base85.b85encode() and base85.b85decode() calls instead.

If you are 100% certain nothing along the path is going to treat your data as text (possibly altering line separators, or interpret and alter other control codes), you could also use the Base128 encoding, reducing the overhead to a 14.3% increase (8 characters for every 7 bytes). I cannot, however, recommend a pip-installable Python module for you; there is a GitHub hosted module but I have not tested it.

Store arbitrary binary data on a system accepting only valid UTF8

Tags:

python

unicode

utf-8

python-2.7

N. McA.

1 Answers

Martijn Pieters

Recent Activity

Donate For Us

Store arbitrary binary data on a system accepting only valid UTF8

Tags:

python

unicode

utf-8

python-2.7

N. McA.

1 Answers

Martijn Pieters

Related questions

Recent Activity

Donate For Us