Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert a string from CP-1251 to UTF-8?

I'm using mutagen to convert ID3 tags data from CP-1251/CP-1252 to UTF-8. In Linux there is no problem. But on Windows, calling SetValue() on a wx.TextCtrl produces the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

The original string (assumed to be CP-1251 encoded) that I'm pulling from mutagen is:

u'\xc1\xe5\xeb\xe0\xff \xff\xe1\xeb\xfb\xed\xff \xe3\xf0\xee\xec\xf3'

I've tried converting this to UTF-8:

dd = d.decode('utf-8')

...and even changing the default encoding from ASCII to UTF-8:

sys.setdefaultencoding('utf-8')

...But I get the same error.

like image 594
jsnjack Avatar asked Sep 26 '11 12:09

jsnjack


People also ask

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.

Are JavaScript strings UTF-8?

While a JavaScript source file can have any kind of encoding, JavaScript will then convert it internally to UTF-16 before executing it. JavaScript strings are all UTF-16 sequences, as the ECMAScript standard says: When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.


4 Answers

If you know for sure that you have cp1251 in your input, you can do

d.decode('cp1251').encode('utf8')
like image 162
Johannes Charra Avatar answered Oct 13 '22 23:10

Johannes Charra


Your string d is a Unicode string, not a UTF-8-encoded string! So you can't decode() it, you must encode() it to UTF-8 or whatever encoding you need.

>>> d = u'\xc1\xe5\xeb\xe0\xff \xff\xe1\xeb\xfb\xed\xff \xe3\xf0\xee\xec\xf3'
>>> d
u'\xc1\xe5\xeb\xe0\xff \xff\xe1\xeb\xfb\xed\xff \xe3\xf0\xee\xec\xf3'
>>> print d
Áåëàÿ ÿáëûíÿ ãðîìó
>>> a.encode("utf-8")
'\xc3\x81\xc3\xa5\xc3\xab\xc3\xa0\xc3\xbf \xc3\xbf\xc3\xa1\xc3\xab\xc3\xbb\xc3\xad\xc3\xbf \xc3\xa3\xc3\xb0\xc3\xae\xc3\xac\xc3\xb3'

(which is something you'd do at the very end of all processing when you need to save it as a UTF-8 encoded file, for example).

If your input is in a different encoding, it's the other way around:

>>> d = "Schoßhündchen"                 # native encoding: cp850
>>> d = "Schoßhündchen".decode("cp850") # decode from Windows codepage
>>> d                                   # into a Unicode string (now work with this!)
u'Scho\xdfh\xfcndchen'
>>> print d                             # it displays correctly if your shell knows the glyphs
Schoßhündchen
>>> d.encode("utf-8")                   # before output, convert to UTF-8
'Scho\xc3\x9fh\xc3\xbcndchen'
like image 39
Tim Pietzcker Avatar answered Oct 13 '22 23:10

Tim Pietzcker


If d is a correct Unicode string, then d.encode('utf-8') yields an encoded UTF-8 bytestring. Don't test it by printing, though, it might be that it just doesn't display properly because of the codepage shenanigans.

like image 28
Cat Plus Plus Avatar answered Oct 14 '22 00:10

Cat Plus Plus


I'd rather add a comment to Александр Степаненко answer but my reputation doesn't yet allow it. I had similar problem of converting MP3 tags from CP-1251 to UTF-8 and the solution of encode/decode/encode worked for me. Except for I had to replace first encoding with "latin-1", which essentially converts Unicode string into byte sequence without real encoding:

print text.encode("latin-1").decode('cp1251').encode('utf8')

and for saving back using for example mutagen it doesn't need to be encoded:

audio["title"] = title.encode("latin-1").decode('cp1251')
like image 30
Andrey Avatar answered Oct 14 '22 00:10

Andrey