I'm using mutagen to convert ID3 tags data from CP-1251/CP-1252 to UTF-8. In Linux there is no problem. But on Windows, calling SetValue()
on a wx.TextCtrl produces the error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
The original string (assumed to be CP-1251 encoded) that I'm pulling from mutagen is:
u'\xc1\xe5\xeb\xe0\xff \xff\xe1\xeb\xfb\xed\xff \xe3\xf0\xee\xec\xf3'
I've tried converting this to UTF-8:
dd = d.decode('utf-8')
...and even changing the default encoding from ASCII to UTF-8:
sys.setdefaultencoding('utf-8')
...But I get the same error.
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
While a JavaScript source file can have any kind of encoding, JavaScript will then convert it internally to UTF-16 before executing it. JavaScript strings are all UTF-16 sequences, as the ECMAScript standard says: When a String contains actual textual data, each element is considered to be a single UTF-16 code unit.
If you know for sure that you have cp1251 in your input, you can do
d.decode('cp1251').encode('utf8')
Your string d
is a Unicode string, not a UTF-8-encoded string! So you can't decode()
it, you must encode()
it to UTF-8 or whatever encoding you need.
>>> d = u'\xc1\xe5\xeb\xe0\xff \xff\xe1\xeb\xfb\xed\xff \xe3\xf0\xee\xec\xf3'
>>> d
u'\xc1\xe5\xeb\xe0\xff \xff\xe1\xeb\xfb\xed\xff \xe3\xf0\xee\xec\xf3'
>>> print d
Áåëàÿ ÿáëûíÿ ãðîìó
>>> a.encode("utf-8")
'\xc3\x81\xc3\xa5\xc3\xab\xc3\xa0\xc3\xbf \xc3\xbf\xc3\xa1\xc3\xab\xc3\xbb\xc3\xad\xc3\xbf \xc3\xa3\xc3\xb0\xc3\xae\xc3\xac\xc3\xb3'
(which is something you'd do at the very end of all processing when you need to save it as a UTF-8 encoded file, for example).
If your input is in a different encoding, it's the other way around:
>>> d = "Schoßhündchen" # native encoding: cp850
>>> d = "Schoßhündchen".decode("cp850") # decode from Windows codepage
>>> d # into a Unicode string (now work with this!)
u'Scho\xdfh\xfcndchen'
>>> print d # it displays correctly if your shell knows the glyphs
Schoßhündchen
>>> d.encode("utf-8") # before output, convert to UTF-8
'Scho\xc3\x9fh\xc3\xbcndchen'
If d
is a correct Unicode string, then d.encode('utf-8')
yields an encoded UTF-8 bytestring. Don't test it by printing, though, it might be that it just doesn't display properly because of the codepage shenanigans.
I'd rather add a comment to Александр Степаненко answer but my reputation doesn't yet allow it. I had similar problem of converting MP3 tags from CP-1251 to UTF-8 and the solution of encode/decode/encode worked for me. Except for I had to replace first encoding with "latin-1", which essentially converts Unicode string into byte sequence without real encoding:
print text.encode("latin-1").decode('cp1251').encode('utf8')
and for saving back using for example mutagen it doesn't need to be encoded:
audio["title"] = title.encode("latin-1").decode('cp1251')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With