Well, let me introduce the problem first.
I've got some data via POST/GET requests. The data were UTF-8 encoded string. Little did I know that, and converted it just by str()
method. And now I have full database of "nonsense data" and couldn't find a way back.
unicode_str - this is the string I should obtain
encoded_str - this is the string I got with POST/GET requests - initial data
bad_str - the data I have in the Database at the moment and I need to get unicode from.
So apparently I know how to convert:
unicode_str =(encode
)=> encoded_str =(str
)=> bad_str
But I couldn't come up with solution back:
bad_str =(???
)=> encoded_str =(decode
)=> unicode_str
In [1]: unicode_str = 'Příliš žluťoučký kůň úpěl ďábelské ódy'
In [2]: unicode_str
Out[2]: 'Příliš žluťoučký kůň úpěl ďábelské ódy'
In [3]: encoded_str = unicode_str.encode("UTF-8")
In [4]: encoded_str
Out[4]: b'P\xc5\x99\xc3\xadli\xc5\xa1 \xc5\xbelu\xc5\xa5ou\xc4\x8dk\xc3\xbd k\xc5\xaf\xc5\x88 \xc3\xbap\xc4\x9bl \xc4\x8f\xc3\xa1belsk\xc3\xa9 \xc3\xb3dy'
In [5]: bad_str = str(encoded_str)
In [6]: bad_str
Out[6]: "b'P\\xc5\\x99\\xc3\\xadli\\xc5\\xa1 \\xc5\\xbelu\\xc5\\xa5ou\\xc4\\x8dk\\xc3\\xbd k\\xc5\\xaf\\xc5\\x88 \\xc3\\xbap\\xc4\\x9bl \\xc4\\x8f\\xc3\\xa1belsk\\xc3\\xa9 \\xc3\\xb3dy'"
In [7]: new_encoded_str = some_magical_function_here(bad_str) ???
Method #2 : Using join() + format() + ord() In this, task of substitution in unicode formatted string is done using format() and ord() is used for conversion.
The String Type Since Python 3.0, the language's str type contains Unicode characters, meaning any string created using "unicode rocks!" , 'unicode rocks!' , or the triple-quoted string syntax is stored as Unicode.
To convert byte strings to Unicode use the bytes. decode() method and use str. encode() to convert Unicode to a byte string. Both methods allow the character set encoding to be specified as an optional parameter if something other than UTF-8 is required.
In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.
You turned a bytes object to a string, which is just a representation of the bytes object. You can obtain the original bytes object by using ast.literal_eval()
(credits to Mark Tolonen for the suggestion), then a simple decode()
will do the job.
>>> import ast
>>> ast.literal_eval(bad_str).decode('utf-8')
'Příliš žluťoučký kůň úpěl ďábelské ódy'
Since you were the one who generated the strings, using eval()
would be safe, but why not be safer?
Please do not use eval, instead:
import codecs
s = 'žluťoučký'
x = str(s.encode('utf-8'))
# strip quotes
x = x[2:-1]
# unescape
x = codecs.escape_decode(x)[0].decode('utf-8')
# profit
x == s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With