Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Enable to decode/encode correctly 𐑖𐑱𐑝𐑾𐑯 π‘¨π‘€π‘“π‘©π‘šπ‘§π‘‘ from bytes in python 3.7.3

I'm struggling with this:

b'"\xc2\xb7\xed\xa0\x81\xed\xb1\x96\xed\xa0\x81\xed\xb1\xb1\xed\xa0\x81\xed\xb1\x9d\xed\xa0\x81\xed\xb1\xbe\xed\xa0\x81\xed\xb1\xaf \xed\xa0\x81\xed\xb1\xa9\xed\xa0\x81\xed\xb1\xa4\xed\xa0\x81\xed\xb1\x93\xed\xa0\x81\xed\xb1\xa9\xed\xa0\x81\xed\xb1\x9a\xed\xa0\x81\xed\xb1\xa7\xed\xa0\x81\xed\xb1\x91"@en'

which comes from a binary format coming from the HDT compressed version (https://github.com/rdfhdt/hdt-cpp) of (dbpedia 3.5.1 (http://dbpedia.org/page/Shavian_alphabet)) and is well decoded in utf8 by this website (https://mothereff.in/utf-8)

And the meaning is: "·𐑖𐑱𐑝𐑾𐑯 π‘©π‘€π‘“π‘©π‘šπ‘§π‘‘"@en

But in python 3.7.3 I encountered the well-known error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 3: invalid continuation byte when trying to mystring.decode('utf8')

If I try to do the contrary: '"·𐑖𐑱𐑝𐑾𐑯 π‘©π‘€π‘“π‘©π‘šπ‘§π‘‘"@en'.encode('utf8)I get the following representation: b'"\xf0\x90\x91\x96\xf0\x90\x91\xb1\xf0\x90\x91\x9d\xf0\x90\x91\xbe\xf0\x90\x91\xaf \xf0\x90\x91\xa8\xf0\x90\x91\xa4\xf0\x90\x91\x93\xf0\x90\x91\xa9\xf0\x90\x91\x9a\xf0\x90\x91\xa7\xf0\x90\x91\x91"@en' which is not the exact same string, but is then decoded repr.decode('utf8') correctly into the same thing....

Can someone help me to understand why decoding the first bytes string is not working? I know the first bytes string is not a valid UTF-8 string due to the error. But then, why is it well decoded by the website I linked and cant be done by python? Thank you in advance!


FINAL EDIT After having accepted the answer I did a few extra researches on this and found this string was encoded using the CESU-8 codec. Which is clearly deprecated today. But some are still using it... So, I found a package which write a variants of the utf-8 codec which can decode this string. I think it will help a lot of people with the same problem as me. Python library: https://github.com/LuminosoInsight/python-ftfy The added codec is 'utf-8-variants'. I hope this will help people in the same needs than me.

like image 678
Folkvir Avatar asked Oct 19 '19 14:10

Folkvir


People also ask

How do you decode bytes in Python?

decode() is used to decode bytes to a string object. Decoding to a string object depends on the specified arguments. It also allows us to mention an error handling scheme to use for seconding errors. Note: bytes is a built-in binary sequence type in Python.

How do you decode an encoded file in Python?

decode() is a method specified in Strings in Python 2. This method is used to convert from one encoding scheme, in which argument string is encoded to the desired encoding scheme. This works opposite to the encode. It accepts the encoding of the encoding string to decode it and returns the original string.

How do I decode in Python 3?

Python 3 - String decode() MethodThe decode() method decodes the string using the codec registered for encoding. It defaults to the default string encoding.

What is the default encoding for bytes decode () in Python 3?

This function returns the bytes object. If we don't provide encoding, β€œutf-8” encoding is used as default.


1 Answers

It seems that Python does not want to accept some sequence of bytes as valid UTF-8, whereas some website (https://mothereff.in/utf-8) accepts it. One of them must be wrong, right? Let's see.

The first two bytes (b'\xc2\xb7') are accepted by Python. The first thing which Python does not like is this: \xed\xa0\x81\xed\xb1\x96, which is interpreted on that website as 𐑖.

Let's look at \xed\xa0\x81\xed\xb1\x96 in binary format:

11101101
10100000
10000001
11101101
10110001
10010110

RFC3629 says that UTF-8 is interpreted as:

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Therefore, there are two three-byte characters:

11101101 10100000 10000001 β‡’ 1101100000000001, or D801

11101101 10110001 10010110 β‡’ 1101110001010110, or DC56

Character D801 is one of the high surrogates and DC56 is one of the low surrogates.

You can see here how to combine the surrogates:

A surrogate pair denotes the code point 0x10000 + (H βˆ’ 0xD800) Γ— 0x400

  • (L βˆ’ 0xDC00) where H and L are the numeric values of the high and low surrogates respectively.

If you combine them, you'll get:

0x10000 + (0xD801 - 0xD800) * 0x400 + (0xDC56 - 0xDC00) = 0x10456, which is 𐑖

However, the high and low surrogates were designed for UTF-16 representation of characters which do not fit into 16 bits, and this is what unicode.org says about using such surrogate pairs in UTF-8:

Q: How do I convert a UTF-16 surrogate pair such as <D800 DC00> to UTF-8? As one 4-byte sequence or as two separate 3-byte sequences?

A: The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single 4-byte sequence. However, there is a widespread practice of generating pairs of 3-byte sequences in older software, especially software which pre-dates the introduction of UTF-16 or that is interoperating with UTF-16 environments under particular constraints. Such an encoding is not conformant to UTF-8 as defined. See UTR #26: Compatability Encoding Scheme for UTF-16: 8-bit (CESU) for a formal description of such a non-UTF-8 data format. When using CESU-8, great care must be taken that data is not accidentally treated as if it was UTF-8, due to the similarity of the formats. [AF]

The key point here is "Such an encoding is not conformant to UTF-8 as defined". So, your input is in fact an invalid UTF-8 sequence, and Python rejected it as such.

To answer the question:

  • https://mothereff.in/utf-8 is ignoring the unicode.org's instruction to treat this as invalid.
  • Python is treating this as invalid.
  • If you want to decode it, even though it is invalid, you can write a function which does what I did manually.
like image 198
zvone Avatar answered Sep 18 '22 00:09

zvone