Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Converting Python 3 String of Bytes of Unicode - `str(utf8_encoded_str)` back to unicode

Well, let me introduce the problem first.

I've got some data via POST/GET requests. The data were UTF-8 encoded string. Little did I know that, and converted it just by str() method. And now I have full database of "nonsense data" and couldn't find a way back.

Example code:

unicode_str - this is the string I should obtain

encoded_str - this is the string I got with POST/GET requests - initial data

bad_str - the data I have in the Database at the moment and I need to get unicode from.

So apparently I know how to convert: unicode_str =(encode)=> encoded_str =(str)=> bad_str

But I couldn't come up with solution back: bad_str =(???)=> encoded_str =(decode)=> unicode_str

In [1]: unicode_str = 'Příliš žluťoučký kůň úpěl ďábelské ódy'

In [2]: unicode_str
Out[2]: 'Příliš žluťoučký kůň úpěl ďábelské ódy'

In [3]: encoded_str = unicode_str.encode("UTF-8")

In [4]: encoded_str
Out[4]: b'P\xc5\x99\xc3\xadli\xc5\xa1 \xc5\xbelu\xc5\xa5ou\xc4\x8dk\xc3\xbd k\xc5\xaf\xc5\x88 \xc3\xbap\xc4\x9bl \xc4\x8f\xc3\xa1belsk\xc3\xa9 \xc3\xb3dy'

In [5]: bad_str = str(encoded_str)

In [6]: bad_str
Out[6]: "b'P\\xc5\\x99\\xc3\\xadli\\xc5\\xa1 \\xc5\\xbelu\\xc5\\xa5ou\\xc4\\x8dk\\xc3\\xbd k\\xc5\\xaf\\xc5\\x88 \\xc3\\xbap\\xc4\\x9bl \\xc4\\x8f\\xc3\\xa1belsk\\xc3\\xa9 \\xc3\\xb3dy'"

In [7]: new_encoded_str = some_magical_function_here(bad_str) ???
like image 257
darkless Avatar asked Nov 16 '17 12:11

darkless


People also ask

How do you convert a string with Unicode encoding to a string of letters in Python?

Method #2 : Using join() + format() + ord() In this, task of substitution in unicode formatted string is done using format() and ord() is used for conversion.

Are strings Unicode in Python 3?

The String Type Since Python 3.0, the language's str type contains Unicode characters, meaning any string created using "unicode rocks!" , 'unicode rocks!' , or the triple-quoted string syntax is stored as Unicode.

How do I encode a byte in Unicode?

To convert byte strings to Unicode use the bytes. decode() method and use str. encode() to convert Unicode to a byte string. Both methods allow the character set encoding to be specified as an optional parameter if something other than UTF-8 is required.

How do you find the Unicode value of a string in Python?

In Python, the built-in functions chr() and ord() are used to convert between Unicode code points and characters. A character can also be represented by writing a hexadecimal Unicode code point with \x , \u , or \U in a string literal.


2 Answers

You turned a bytes object to a string, which is just a representation of the bytes object. You can obtain the original bytes object by using ast.literal_eval() (credits to Mark Tolonen for the suggestion), then a simple decode() will do the job.

>>> import ast
>>> ast.literal_eval(bad_str).decode('utf-8')
'Příliš žluťoučký kůň úpěl ďábelské ódy'

Since you were the one who generated the strings, using eval() would be safe, but why not be safer?

like image 105
Reti43 Avatar answered Oct 20 '22 13:10

Reti43


Please do not use eval, instead:

import codecs
s = 'žluťoučký'
x = str(s.encode('utf-8'))

# strip quotes
x = x[2:-1]

# unescape
x = codecs.escape_decode(x)[0].decode('utf-8')

# profit
x == s
like image 32
Honza Král Avatar answered Oct 20 '22 13:10

Honza Král