I'm struggling with this:
b'"\xc2\xb7\xed\xa0\x81\xed\xb1\x96\xed\xa0\x81\xed\xb1\xb1\xed\xa0\x81\xed\xb1\x9d\xed\xa0\x81\xed\xb1\xbe\xed\xa0\x81\xed\xb1\xaf \xed\xa0\x81\xed\xb1\xa9\xed\xa0\x81\xed\xb1\xa4\xed\xa0\x81\xed\xb1\x93\xed\xa0\x81\xed\xb1\xa9\xed\xa0\x81\xed\xb1\x9a\xed\xa0\x81\xed\xb1\xa7\xed\xa0\x81\xed\xb1\x91"@en'
which comes from a binary format coming from the HDT compressed version (https://github.com/rdfhdt/hdt-cpp) of (dbpedia 3.5.1 (http://dbpedia.org/page/Shavian_alphabet)) and is well decoded in utf8 by this website (https://mothereff.in/utf-8)
And the meaning is: "Β·ππ±ππΎπ― π©π€ππ©ππ§π"@en
But in python 3.7.3 I encountered the well-known error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 3: invalid continuation byte
when trying to mystring.decode('utf8')
If I try to do the contrary: '"Β·ππ±ππΎπ― π©π€ππ©ππ§π"@en'.encode('utf8)
I get the following representation: b'"\xf0\x90\x91\x96\xf0\x90\x91\xb1\xf0\x90\x91\x9d\xf0\x90\x91\xbe\xf0\x90\x91\xaf \xf0\x90\x91\xa8\xf0\x90\x91\xa4\xf0\x90\x91\x93\xf0\x90\x91\xa9\xf0\x90\x91\x9a\xf0\x90\x91\xa7\xf0\x90\x91\x91"@en'
which is not the exact same string, but is then decoded repr.decode('utf8')
correctly into the same thing....
Can someone help me to understand why decoding the first bytes string is not working? I know the first bytes string is not a valid UTF-8 string due to the error. But then, why is it well decoded by the website I linked and cant be done by python? Thank you in advance!
FINAL EDIT After having accepted the answer I did a few extra researches on this and found this string was encoded using the CESU-8 codec. Which is clearly deprecated today. But some are still using it... So, I found a package which write a variants of the utf-8 codec which can decode this string. I think it will help a lot of people with the same problem as me. Python library: https://github.com/LuminosoInsight/python-ftfy The added codec is 'utf-8-variants'. I hope this will help people in the same needs than me.
decode() is used to decode bytes to a string object. Decoding to a string object depends on the specified arguments. It also allows us to mention an error handling scheme to use for seconding errors. Note: bytes is a built-in binary sequence type in Python.
decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.
Python 3 - String decode() MethodThe decode() method decodes the string using the codec registered for encoding. It defaults to the default string encoding.
This function returns the bytes object. If we don't provide encoding, βutf-8β encoding is used as default.
It seems that Python does not want to accept some sequence of bytes as valid UTF-8, whereas some website (https://mothereff.in/utf-8) accepts it. One of them must be wrong, right? Let's see.
The first two bytes (b'\xc2\xb7'
) are accepted by Python. The first thing which Python does not like is this: \xed\xa0\x81\xed\xb1\x96
, which is interpreted on that website as π.
Let's look at \xed\xa0\x81\xed\xb1\x96
in binary format:
11101101
10100000
10000001
11101101
10110001
10010110
RFC3629 says that UTF-8 is interpreted as:
Char. number range | UTF-8 octet sequence (hexadecimal) | (binary) --------------------+--------------------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 110xxxxx 10xxxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Therefore, there are two three-byte characters:
11101101 10100000 10000001 β 1101100000000001, or D801
11101101 10110001 10010110 β 1101110001010110, or DC56
Character D801
is one of the high surrogates and DC56
is one of the low surrogates.
You can see here how to combine the surrogates:
A surrogate pair denotes the code point 0x10000 + (H β 0xD800) Γ 0x400
- (L β 0xDC00) where H and L are the numeric values of the high and low surrogates respectively.
If you combine them, you'll get:
0x10000 + (0xD801 - 0xD800) * 0x400 + (0xDC56 - 0xDC00) = 0x10456, which is π
However, the high and low surrogates were designed for UTF-16 representation of characters which do not fit into 16 bits, and this is what unicode.org says about using such surrogate pairs in UTF-8:
Q: How do I convert a UTF-16 surrogate pair such as
<D800 DC00>
to UTF-8? As one 4-byte sequence or as two separate 3-byte sequences?A: The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single 4-byte sequence. However, there is a widespread practice of generating pairs of 3-byte sequences in older software, especially software which pre-dates the introduction of UTF-16 or that is interoperating with UTF-16 environments under particular constraints. Such an encoding is not conformant to UTF-8 as defined. See UTR #26: Compatability Encoding Scheme for UTF-16: 8-bit (CESU) for a formal description of such a non-UTF-8 data format. When using CESU-8, great care must be taken that data is not accidentally treated as if it was UTF-8, due to the similarity of the formats. [AF]
The key point here is "Such an encoding is not conformant to UTF-8 as defined". So, your input is in fact an invalid UTF-8 sequence, and Python rejected it as such.
To answer the question:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With