Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python zlib output, how to recover out of mysql utf-8 table?

In python, I compressed a string using zlib, and then inserted it into a mysql column that is of type blob, using the utf-8 encoding. The string comes back as utf-8, but it's not clear how to get it back into a format where I can decompress it. Here is some pseduo-output:

valueInserted = zlib.compress('a') = 'x\x9cK\x04\x00\x00b\x00b'

valueFromSqlColumn = u'x\x9cK\x04\x00\x00b\x00b'

zlib.decompress(valueFromSqlColumn) UnicodeEncodeError: 'ascii' codec can't encode character u'\x9c' in position 1: ordinal not in range(128)

if i do this, it inserts some extra characters:

valueFromSqlColumn.encode('utf-8') = 'x\xc2\x9cK\x04\x00\x00b\x00b'

Any suggestions?

like image 742
Heinrich Schmetterling Avatar asked Oct 24 '09 20:10

Heinrich Schmetterling


2 Answers

Unicode is designed to be compatible with latin-1, so try:

>>> import zlib
>>> u = zlib.compress("test").decode('latin1')
>>> u
u'x\x9c+I-.\x01\x00\x04]\x01\xc1'

And then

>>> zlib.decompress(u.encode('latin1'))
'test'

EDIT: Fixed typo, latin-1 isn't designed to be compatible with unicode, it's the other way around.

like image 88
csl Avatar answered Oct 06 '22 00:10

csl


You have a unicode object that is really encoding bytes. That's unfortunate, since unicode strings should really only be coding text, right?

Anyway, what we want to do is to construct a byte string.. this is a str in Python 2.x. We see by the printed string you gave u'x\x9cK\x04\x00\x00b\x00b' that the byte values are encoded as unicode codepoints. We can get the numerical value of a codepoint by using the function ord(..). Then we can get the byte string representation of that number with the function chr(..). Let's try this:

>>> ord(u"A")
65
>>> chr(_)
'A'

So we can decode the string ourselves:

>>> udata = u'x\x9cK\x04\x00\x00b\x00b'
>>> bdata = "".join(chr(ord(uc)) for uc in udata)
>>> bdata
'x\x9cK\x04\x00\x00b\x00b'

(Wait, what does the above code do? The join stuff? What we first do is create a list of the code points in the string:

>>> [ord(uc) for uc in udata]
[120, 156, 75, 4, 0, 0, 98, 0, 98]

Then we intepret the numbers as bytes, converting them individually:

>>> [chr(ord(uc)) for uc in udata]
['x', '\x9c', 'K', '\x04', '\x00', '\x00', 'b', '\x00', 'b']

Finally, we join them with "" as separator using "".join(list-of-strings)

End of Wait..)

However, cls cleverly notes that the Latin-1 encoding has the property that a character's byte value in the Latin-1 encoding is equal to the character's codepoint in Unicode. Given, of course, that the character is inside the range 0 to 255 where Latin-1 is defined. This means we can do the byte conversion directly with Latin-1:

>>> udata = u'x\x9cK\x04\x00\x00b\x00b'
>>> udata.encode("latin-1")
'x\x9cK\x04\x00\x00b\x00b'

Which as you can see, gives the same result.

like image 30
u0b34a0f6ae Avatar answered Oct 06 '22 00:10

u0b34a0f6ae