Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I convert a unicode to a string at the Python level?

The following unicode and string can exist on their own if defined explicitly:

>>> value_str='Andr\xc3\xa9'
>>> value_uni=u'Andr\xc3\xa9'

If I only have u'Andr\xc3\xa9' assigned to a variable like above, how do I convert it to 'Andr\xc3\xa9' in Python 2.5 or 2.6?

EDIT:

I did the following:

>>> value_uni.encode('latin-1')
'Andr\xc3\xa9'

which fixes my issue. Can someone explain to me what exactly is happening?

like image 391
Thierry Lam Avatar asked May 06 '10 17:05

Thierry Lam


People also ask

How do you make a Unicode string in Python?

You have two options to create Unicode string in Python. Either use decode() , or create a new Unicode string with UTF-8 encoding by unicode(). The unicode() method is unicode(string[, encoding, errors]) , its arguments should be 8-bit strings.

Can we convert Unicode to text?

World's simplest unicode tool. This browser-based utility converts fancy Unicode text back to regular text. All Unicode glyphs that you paste or enter in the text area as the input automatically get converted to simple ASCII characters in the output.

How do I change encoding type in Python?

To convert between types, you simply use the type name as a function. There are several built-in functions to perform conversion from one data type to another. These functions return a new object representing the converted value.


2 Answers

You seem to have gotten your encodings muddled up. It seems likely that what you really want is u'Andr\xe9' which is equivalent to 'André'.

But what you have seems to be a UTF-8 encoding that has been incorrectly decoded. You can fix it by converting the unicode string to an ordinary string. I'm not sure what the best way is, but this seems to work:

>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9')
'Andr\xc3\xa9'

Then decode it correctly:

>>> ''.join(chr(ord(c)) for c in u'Andr\xc3\xa9').decode('utf8')
u'Andr\xe9'    

Now it is in the correct format.

However instead of doing this, if possible you should try to work out why the data has been incorrectly encoded in the first place, and fix that problem there.

like image 98
Mark Byers Avatar answered Sep 21 '22 18:09

Mark Byers


You asked (in a comment) """That is what's puzzling me. How did it go from it original accented to what it is now? When you say double encoding with utf8 and latin1, is that a total of 3 encodings(2 utf8 + 1 latin1)? What's the order of the encode from the original state to the current one?"""

In the answer by Mark Byers, he says """what you have seems to be a UTF-8 encoding that has been incorrectly decoded""". You have accepted his answer. But you are still puzzled? OK, here's the blow-by-blow description:

Note: All strings will be displayed using (implicitly) repr(). unicodedata.name() will be used to verify the contents. That way, variations in console encoding cannot confuse interpretation of the strings.

Initial state: you have a unicode object that you have named u1. It contains e-acute:

>>> u1 = u'\xe9'
>>> import unicodedata as ucd
>>> ucd.name(u1)
'LATIN SMALL LETTER E WITH ACUTE'

You encode u1 as UTF-8 and name the result s:

>>> s = u1.encode('utf8')
>>> s
'\xc3\xa9'

You decode s using latin1 -- INCORRECTLY; s was encoded using utf8, NOT latin1. The result is meaningless rubbish.

>>> u2 = s.decode('latin1')
>>> u2
u'\xc3\xa9'
>>> ucd.name(u2[0]); ucd.name(u2[1])
'LATIN CAPITAL LETTER A WITH TILDE'
'COPYRIGHT SIGN'
>>>

Please understand: unicode_object.encode('x').decode('y) when x != y is normally [see note below] a nonsense; it will raise an exception if you are lucky; if you are unlucky it will silently create gibberish. Also please understand that silently creating gibberish is not a bug -- there is no general way that Python (or any other language) can detect that a nonsense has been committed. This applies particularly when latin1 is involved, because all 256 codepoints map 1 to 1 with the first 256 Unicode codepoints, so it is impossible to get a UnicodeDecodeError from str_object.decode('latin1').

Of course, abnormally (one hopes that it's abnormal) you may need to reverse out such a nonsense by doing gibberish_unicode_object.encode('y').decode('x') as suggested in various answers to your question.

like image 36
John Machin Avatar answered Sep 23 '22 18:09

John Machin