The following unicode and string can exist on their own if defined explicitly: <pre class="prettyprint"><code>>>> value_str='Andr\xc3\xa9' >>> value_uni=u'Andr\xc3\xa9' </code></pre> If I only have <code>u'Andr\xc3\xa9'</code> assigned to a variable like above, how do I convert it to <code>'Andr\xc3\xa9'</code> in Python 2.5 or 2.6? EDIT: I did the following: <pre class="prettyprint"><code>>>> value_uni.encode('latin-1') 'Andr\xc3\xa9' </code></pre> which fixes my issue. Can someone explain to me what exactly is happening?

You seem to have gotten your encodings muddled up. It seems likely that what you really want is <code>u'Andr\xe9'</code> which is equivalent to <code>'André'</code>. But what you have seems to be a UTF-8 encoding that has been incorrectly decoded. You can fix it by converting the unicode string to an ordinary string. I'm not sure what the best way is, but this seems to work: <pre class="prettyprint"><code>>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9') 'Andr\xc3\xa9' </code></pre> Then decode it correctly: <pre class="prettyprint"><code>>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9').decode('utf8') u'Andr\xe9' </code></pre> Now it is in the correct format. However instead of doing this, if possible you should try to work out why the data has been incorrectly encoded in the first place, and fix that problem there.

You asked (in a comment) """That is what's puzzling me. How did it go from it original accented to what it is now? When you say double encoding with utf8 and latin1, is that a total of 3 encodings(2 utf8 + 1 latin1)? What's the order of the encode from the original state to the current one?""" In the answer by Mark Byers, he says """what you have seems to be a UTF-8 encoding that has been incorrectly decoded""". You have accepted his answer. But you are still puzzled? OK, here's the blow-by-blow description: Note: All strings will be displayed using (implicitly) <code>repr()</code>. <code>unicodedata.name()</code> will be used to verify the contents. That way, variations in console encoding cannot confuse interpretation of the strings. Initial state: you have a unicode object that you have named u1. It contains e-acute: <pre class="prettyprint"><code>>>> u1 = u'\xe9' >>> import unicodedata as ucd >>> ucd.name(u1) 'LATIN SMALL LETTER E WITH ACUTE' </code></pre> You encode u1 as UTF-8 and name the result s: <pre class="prettyprint"><code>>>> s = u1.encode('utf8') >>> s '\xc3\xa9' </code></pre> You decode s using latin1 -- INCORRECTLY; s was encoded using utf8, NOT latin1. The result is meaningless rubbish. <pre class="prettyprint"><code>>>> u2 = s.decode('latin1') >>> u2 u'\xc3\xa9' >>> ucd.name(u2[0]); ucd.name(u2[1]) 'LATIN CAPITAL LETTER A WITH TILDE' 'COPYRIGHT SIGN' >>> </code></pre> Please understand: <code>unicode_object.encode('x').decode('y)</code> when x != y is normally [see note below] a nonsense; it will raise an exception if you are lucky; if you are unlucky it will silently create gibberish. Also please understand that silently creating gibberish is not a bug -- there is no general way that Python (or any other language) can detect that a nonsense has been committed. This applies particularly when latin1 is involved, because all 256 codepoints map 1 to 1 with the first 256 Unicode codepoints, so it is impossible to get a UnicodeDecodeError from str_object.decode('latin1'). Of course, abnormally (one hopes that it's abnormal) you may need to reverse out such a nonsense by doing <code>gibberish_unicode_object.encode('y').decode('x')</code> as suggested in various answers to your question.

How do I convert a unicode to a string at the Python level?

Tags:

python

unicode

python-2.x

The following unicode and string can exist on their own if defined explicitly:

>>> value_str='Andr\xc3\xa9'
>>> value_uni=u'Andr\xc3\xa9'

If I only have u'Andr\xc3\xa9' assigned to a variable like above, how do I convert it to 'Andr\xc3\xa9' in Python 2.5 or 2.6?

EDIT:

I did the following:

>>> value_uni.encode('latin-1')
'Andr\xc3\xa9'

which fixes my issue. Can someone explain to me what exactly is happening?

391

asked May 06 '10 17:05

Thierry Lam

2 Answers

You seem to have gotten your encodings muddled up. It seems likely that what you really want is u'Andr\xe9' which is equivalent to 'André'.

But what you have seems to be a UTF-8 encoding that has been incorrectly decoded. You can fix it by converting the unicode string to an ordinary string. I'm not sure what the best way is, but this seems to work:

>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9')
'Andr\xc3\xa9'

Then decode it correctly:

>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9').decode('utf8')
u'Andr\xe9'

Now it is in the correct format.

However instead of doing this, if possible you should try to work out why the data has been incorrectly encoded in the first place, and fix that problem there.

answered Sep 21 '22 18:09

Mark Byers

You asked (in a comment) """That is what's puzzling me. How did it go from it original accented to what it is now? When you say double encoding with utf8 and latin1, is that a total of 3 encodings(2 utf8 + 1 latin1)? What's the order of the encode from the original state to the current one?"""

In the answer by Mark Byers, he says """what you have seems to be a UTF-8 encoding that has been incorrectly decoded""". You have accepted his answer. But you are still puzzled? OK, here's the blow-by-blow description:

Note: All strings will be displayed using (implicitly) repr(). unicodedata.name() will be used to verify the contents. That way, variations in console encoding cannot confuse interpretation of the strings.

Initial state: you have a unicode object that you have named u1. It contains e-acute:

>>> u1 = u'\xe9'
>>> import unicodedata as ucd
>>> ucd.name(u1)
'LATIN SMALL LETTER E WITH ACUTE'

You encode u1 as UTF-8 and name the result s:

>>> s = u1.encode('utf8')
>>> s
'\xc3\xa9'

You decode s using latin1 -- INCORRECTLY; s was encoded using utf8, NOT latin1. The result is meaningless rubbish.

>>> u2 = s.decode('latin1')
>>> u2
u'\xc3\xa9'
>>> ucd.name(u2[0]); ucd.name(u2[1])
'LATIN CAPITAL LETTER A WITH TILDE'
'COPYRIGHT SIGN'
>>>

Please understand: unicode_object.encode('x').decode('y) when x != y is normally [see note below] a nonsense; it will raise an exception if you are lucky; if you are unlucky it will silently create gibberish. Also please understand that silently creating gibberish is not a bug -- there is no general way that Python (or any other language) can detect that a nonsense has been committed. This applies particularly when latin1 is involved, because all 256 codepoints map 1 to 1 with the first 256 Unicode codepoints, so it is impossible to get a UnicodeDecodeError from str_object.decode('latin1').

Of course, abnormally (one hopes that it's abnormal) you may need to reverse out such a nonsense by doing gibberish_unicode_object.encode('y').decode('x') as suggested in various answers to your question.

answered Sep 23 '22 18:09

John Machin

Related questions
                            
                                Check for words from list and remove those words in pandas dataframe column
                            
                                What is the difference between combine_first and fillna?
                            
                                Convert pandas DataFrame column of comma separated strings to one-hot encoded
                            
                                python: install dash with conda
                            
                                how to do forward rolling sum in pandas?
                            
                                How to print the progress of a list comprehension in python?
                            
                                How can I generate and display a grid of images in PyTorch with plt.imshow and torchvision.utils.make_grid?
                            
                                How to multiply a tensor row-wise by a vector in PyTorch?
                            
                                Pandas groupby two columns and plot
                            
                                Shuffle one column in pandas dataframe
                            
                                Using Dictionaries with numba njit function
                            
                                How to edit a message in discord.py
                            
                                PostgreSQL- ModuleNotFoundError: No module named 'psycopg2'
                            
                                Prevent Sharing of Y Axes in Seaborn Relplot
                            
                                How can I set an attribute in a frozen dataclass custom __init__ method?
                            
                                Python type hints for function returning multiple return values
                            
                                TOML vs YAML vs StrictYAML
                            
                                "Upload" a file from django shell
                            
                                Get offset of current buffer in vim (in particular, via python scripting)
                            
                                Pythonic Way to reverse nested dictionaries

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With