Well, let me introduce the problem first. I've got some data via POST/GET requests. The data were UTF-8 encoded string. Little did I know that, and converted it just by <code>str()</code> method. And now I have full database of "nonsense data" and couldn't find a way back. <h3>Example code:</h3> unicode_str - this is the string I should obtain encoded_str - this is the string I got with POST/GET requests - initial data bad_str - the data I have in the Database at the moment and I need to get unicode from. So apparently I know how to convert: unicode_str =(<code>encode</code>)=> encoded_str =(<code>str</code>)=> bad_str But I couldn't come up with solution back: bad_str =(<code>???</code>)=> encoded_str =(<code>decode</code>)=> unicode_str <pre class="prettyprint"><code>In [1]: unicode_str = 'Příli&scaron; žluťoučký kůň úpěl ďábelské ódy' In [2]: unicode_str Out[2]: 'Příli&scaron; žluťoučký kůň úpěl ďábelské ódy' In [3]: encoded_str = unicode_str.encode("UTF-8") In [4]: encoded_str Out[4]: b'P\xc5\x99\xc3\xadli\xc5\xa1 \xc5\xbelu\xc5\xa5ou\xc4\x8dk\xc3\xbd k\xc5\xaf\xc5\x88 \xc3\xbap\xc4\x9bl \xc4\x8f\xc3\xa1belsk\xc3\xa9 \xc3\xb3dy' In [5]: bad_str = str(encoded_str) In [6]: bad_str Out[6]: "b'P\\xc5\\x99\\xc3\\xadli\\xc5\\xa1 \\xc5\\xbelu\\xc5\\xa5ou\\xc4\\x8dk\\xc3\\xbd k\\xc5\\xaf\\xc5\\x88 \\xc3\\xbap\\xc4\\x9bl \\xc4\\x8f\\xc3\\xa1belsk\\xc3\\xa9 \\xc3\\xb3dy'" In [7]: new_encoded_str = some_magical_function_here(bad_str) ??? </code></pre>

You turned a bytes object to a string, which is just a representation of the bytes object. You can obtain the original bytes object by using <code>ast.literal_eval()</code> (credits to Mark Tolonen for the suggestion), then a simple <code>decode()</code> will do the job. <pre class="prettyprint"><code>>>> import ast >>> ast.literal_eval(bad_str).decode('utf-8') 'Příli&scaron; žluťoučký kůň úpěl ďábelské ódy' </code></pre> Since you were the one who generated the strings, using <code>eval()</code> would be safe, but why not be safer?

Converting Python 3 String of Bytes of Unicode - `str(utf8_encoded_str)` back to unicode

Tags:

python

string

unicode

Well, let me introduce the problem first.

I've got some data via POST/GET requests. The data were UTF-8 encoded string. Little did I know that, and converted it just by str() method. And now I have full database of "nonsense data" and couldn't find a way back.

Example code:

unicode_str - this is the string I should obtain

encoded_str - this is the string I got with POST/GET requests - initial data

bad_str - the data I have in the Database at the moment and I need to get unicode from.

So apparently I know how to convert: unicode_str =(encode)=> encoded_str =(str)=> bad_str

But I couldn't come up with solution back: bad_str =(???)=> encoded_str =(decode)=> unicode_str

In [1]: unicode_str = 'Příliš žluťoučký kůň úpěl ďábelské ódy'

In [2]: unicode_str
Out[2]: 'Příliš žluťoučký kůň úpěl ďábelské ódy'

In [3]: encoded_str = unicode_str.encode("UTF-8")

In [4]: encoded_str
Out[4]: b'P\xc5\x99\xc3\xadli\xc5\xa1 \xc5\xbelu\xc5\xa5ou\xc4\x8dk\xc3\xbd k\xc5\xaf\xc5\x88 \xc3\xbap\xc4\x9bl \xc4\x8f\xc3\xa1belsk\xc3\xa9 \xc3\xb3dy'

In [5]: bad_str = str(encoded_str)

In [6]: bad_str
Out[6]: "b'P\\xc5\\x99\\xc3\\xadli\\xc5\\xa1 \\xc5\\xbelu\\xc5\\xa5ou\\xc4\\x8dk\\xc3\\xbd k\\xc5\\xaf\\xc5\\x88 \\xc3\\xbap\\xc4\\x9bl \\xc4\\x8f\\xc3\\xa1belsk\\xc3\\xa9 \\xc3\\xb3dy'"

In [7]: new_encoded_str = some_magical_function_here(bad_str) ???

257

asked Nov 16 '17 12:11

darkless

2 Answers

You turned a bytes object to a string, which is just a representation of the bytes object. You can obtain the original bytes object by using ast.literal_eval() (credits to Mark Tolonen for the suggestion), then a simple decode() will do the job.

>>> import ast
>>> ast.literal_eval(bad_str).decode('utf-8')
'Příliš žluťoučký kůň úpěl ďábelské ódy'

Since you were the one who generated the strings, using eval() would be safe, but why not be safer?

105

answered Oct 20 '22 13:10

Reti43

Please do not use eval, instead:

import codecs
s = 'žluťoučký'
x = str(s.encode('utf-8'))

# strip quotes
x = x[2:-1]

# unescape
x = codecs.escape_decode(x)[0].decode('utf-8')

# profit
x == s

answered Oct 20 '22 13:10

Honza Král

Related questions
                            
                                How to check in python that at least one of the default parameters of the function specified
                            
                                How to use joblib.Memory of cache the output of a member function of a Python Class
                            
                                Adding multiple recipients using google api in python?
                            
                                What is the equivalent to a Matlab cell array?
                            
                                Python - function similar to VLOOKUP (Excel)
                            
                                How have access to both cls and self in a method
                            
                                keras predict always output same value in multi-classification
                            
                                Setting command line arguments for main function tests
                            
                                How can I reuse a Dense layer?
                            
                                Extract coordinate values in xarray
                            
                                Should I bother locking the queue when I put to or get from it?
                            
                                How do you set a string of bytes from an environment variable in Python?
                            
                                Merge multiple columns into one column in pyspark dataframe using python
                            
                                Scatterplot with point colors representing a continuous variable in seaborn FacetGrid
                            
                                No matching distribution found for django [closed]
                            
                                PyTorch: Extract learned weights correctly
                            
                                find_element_by_class_name for multiple classes [duplicate]
                            
                                Python Split String On Newline And Keep Newline
                            
                                Python unittest import problems
                            
                                Stop Jupyter notebook from generating new blank cells after every alt-enter (run)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With