Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handle wrongly encoded character in Python unicode string

I am dealing with unicode strings returned by the python-lastfm library.

I assume somewhere on the way, the library gets the encoding wrong and returns a unicode string that may contain invalid characters.

For example, the original string i am expecting in the variable a is "Glück"

>>> a
u'Gl\xfcck'
>>> print a
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 2: ordinal not in range(128)

\xfc is the escaped value 252, which corresponds to the latin1 encoding of "ü". Somehow this gets embedded in the unicode string in a way python can't handle on its own.

How do i convert this back a normal or unicode string that contains the original "Glück"? I tried playing around with the decode/encode methods, but either got a UnicodeEncodeError, or a string containing the sequence \xfc.

like image 500
strfry Avatar asked Apr 22 '11 23:04

strfry


People also ask

How do I fix Unicode encode errors in Python?

The key to troubleshooting Unicode errors in Python is to know what types you have. Then, try these steps: If some variables are byte sequences instead of Unicode objects, convert them to Unicode objects with decode() / u” before handling them.

How do you escape a Unicode character in Python?

In python, to remove Unicode character from string python we need to encode the string by using str. encode() for removing the Unicode characters from the string.

Can Python handle Unicode?

Python's string type uses the Unicode Standard for representing characters, which lets Python programs work with all these different possible characters. Unicode (https://www.unicode.org/) is a specification that aims to list every character used by human languages and give each character its own unique code.


1 Answers

You have to convert your unicode string into a standard string using some encoding e.g. utf-8:

some_unicode_string.encode('utf-8')

Apart from that: this is a dupe of

BeautifulSoup findall with class attribute- unicode encode error

and at least ten other related questions on SO. Research first.

like image 155
Andreas Jung Avatar answered Oct 03 '22 16:10

Andreas Jung