I am trying to decode from a Brazilian Portogese text:
'Demais Subfun\xc3\xa7\xc3\xb5es 12'
It should be
'Demais Subfunções 12'
>> a.decode('unicode_escape')
>> a.encode('unicode_escape')
>> a.decode('ascii')
>> a.encode('ascii')
all give:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13:
ordinal not in range(128)
on the other hand this gives:
>> print a.encode('utf-8')
Demais Subfun├â┬º├â┬Áes 12
>> print a
Demais Subfunções 12
You have binary data that is not ASCII encoded. The \xhh
codepoints indicate your data is encoded with a different codec, and you are seeing Python produce a representation of the data using the repr()
function that can be re-used as a Python literal that accurately lets you re-create the exact same value. This representation is very useful when debugging a program.
In other words, the \xhh
escape sequences represent individual bytes, and the hh
is the hex value of that byte. You have 4 bytes with hex values C3, A7, C3 and B5, that do not map to printable ASCII characters so Python uses the \xhh
notation instead.
You instead have UTF-8 data, decode it as such:
>>> 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8')
u'Demais Subfun\xe7\xf5es 12'
>>> print 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8')
Demais Subfunções 12
The C3 A7 bytes together encode U+00E7 LATIN SMALL LETTER C WITH CEDILLA, while the C3 B5 bytes encode U+00F5 LATIN SMALL LETTER O WITH TILDE.
ASCII happens to be a subset of the UTF-8 codec, which is why all the other letters can be represented as such in the Python repr()
output.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With