I have arbitrary binary data. I need to store it in a system that expects valid UTF8. It will never be interpreted as text, I just need to put it in there and be able to retrieve it and reconstitute my binary data.
Obviously base64 would work, but I can't have that much inflation.
How can I easily achieve this in python 2.7?
You'll have to express your data using just ASCII characters. Using Base64 is the most efficient method (available in the Python standard library) to do this, in terms of making binary data fit in printable text that is also UTF-8 safe. Sure, it requires 33% more space to express the same data, but other methods take more additional space.
You can combine this with compression to limit how much space this is going to take, but make the compression optional (mark the data) and only actually use it if the data is going to be smaller.
import zlib
import base64
def pack_utf8_safe(data):
is_compressed = False
compressed = zlib.compress(data)
if len(compressed) < (len(data) - 1):
data = compressed
is_compressed = True
base64_encoded = base64.b64encode(data)
if is_compressed:
base64_encoded = '.' + base64_encoded
return base64_encoded
def unpack_utf8_safe(base64_encoded):
decompress = False
if base64_encoded.startswith('.'):
base64_encoded = base64_encoded[1:]
decompress = True
data = base64.b64decode(base64_encoded)
if decompress:
data = zlib.decompress(data)
return data
The '.'
character is not part of the Base64 alphabet, so I used it here to mark compressed data.
You could further shave of the 1 or 2 =
padding characters from the end of the Base64 encoded data; these can then be re-added when decoding (add '=' * (-len(encoded) * 4)
to the end), but I'm not sure that's worth the bother.
You can achieve further savings by switching to the Base85 encoding, a 4-to-5 ratio ASCII-safe encoding for binary data, so a 20% overhead. For Python 2.7 this is only available in an external library (Python 3.4 added it to the base64
library). You can use python-mom
project in 2.7:
from mom.codec import base85
and replace all base64.b64encode()
and base64.b64decode()
calls with base85.b85encode()
and base85.b85decode()
calls instead.
If you are 100% certain nothing along the path is going to treat your data as text (possibly altering line separators, or interpret and alter other control codes), you could also use the Base128 encoding, reducing the overhead to a 14.3% increase (8 characters for every 7 bytes). I cannot, however, recommend a pip-installable Python module for you; there is a GitHub hosted module but I have not tested it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With