Enable to decode/encode correctly 𐑖𐑱𐑝𐑾𐑯 𐑨𐑤𐑓𐑩𐑚𐑧𐑑 from bytes in python 3.7.3

Tags:

I'm struggling with this:

b'"\xc2\xb7\xed\xa0\x81\xed\xb1\x96\xed\xa0\x81\xed\xb1\xb1\xed\xa0\x81\xed\xb1\x9d\xed\xa0\x81\xed\xb1\xbe\xed\xa0\x81\xed\xb1\xaf \xed\xa0\x81\xed\xb1\xa9\xed\xa0\x81\xed\xb1\xa4\xed\xa0\x81\xed\xb1\x93\xed\xa0\x81\xed\xb1\xa9\xed\xa0\x81\xed\xb1\x9a\xed\xa0\x81\xed\xb1\xa7\xed\xa0\x81\xed\xb1\x91"@en'

which comes from a binary format coming from the HDT compressed version (https://github.com/rdfhdt/hdt-cpp) of (dbpedia 3.5.1 (http://dbpedia.org/page/Shavian_alphabet)) and is well decoded in utf8 by this website (https://mothereff.in/utf-8)

And the meaning is: "·𐑖𐑱𐑝𐑾𐑯 𐑩𐑤𐑓𐑩𐑚𐑧𐑑"@en

But in python 3.7.3 I encountered the well-known error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xed in position 3: invalid continuation byte when trying to mystring.decode('utf8')

If I try to do the contrary: '"·𐑖𐑱𐑝𐑾𐑯 𐑩𐑤𐑓𐑩𐑚𐑧𐑑"@en'.encode('utf8)I get the following representation: b'"\xf0\x90\x91\x96\xf0\x90\x91\xb1\xf0\x90\x91\x9d\xf0\x90\x91\xbe\xf0\x90\x91\xaf \xf0\x90\x91\xa8\xf0\x90\x91\xa4\xf0\x90\x91\x93\xf0\x90\x91\xa9\xf0\x90\x91\x9a\xf0\x90\x91\xa7\xf0\x90\x91\x91"@en' which is not the exact same string, but is then decoded repr.decode('utf8') correctly into the same thing....

Can someone help me to understand why decoding the first bytes string is not working? I know the first bytes string is not a valid UTF-8 string due to the error. But then, why is it well decoded by the website I linked and cant be done by python? Thank you in advance!

FINAL EDIT After having accepted the answer I did a few extra researches on this and found this string was encoded using the CESU-8 codec. Which is clearly deprecated today. But some are still using it... So, I found a package which write a variants of the utf-8 codec which can decode this string. I think it will help a lot of people with the same problem as me. Python library: https://github.com/LuminosoInsight/python-ftfy The added codec is 'utf-8-variants'. I hope this will help people in the same needs than me.

678

asked Oct 19 '19 14:10

Folkvir

1 Answers

It seems that Python does not want to accept some sequence of bytes as valid UTF-8, whereas some website (https://mothereff.in/utf-8) accepts it. One of them must be wrong, right? Let's see.

The first two bytes (b'\xc2\xb7') are accepted by Python. The first thing which Python does not like is this: \xed\xa0\x81\xed\xb1\x96, which is interpreted on that website as 𐑖.

Let's look at \xed\xa0\x81\xed\xb1\x96 in binary format:

RFC3629 says that UTF-8 is interpreted as:

   Char. number range  |        UTF-8 octet sequence
      (hexadecimal)    |              (binary)
   --------------------+---------------------------------------------
   0000 0000-0000 007F | 0xxxxxxx
   0000 0080-0000 07FF | 110xxxxx 10xxxxxx
   0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
   0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Therefore, there are two three-byte characters:

11101101 10100000 10000001 ⇒ 1101100000000001, or D801

11101101 10110001 10010110 ⇒ 1101110001010110, or DC56

Character D801 is one of the high surrogates and DC56 is one of the low surrogates.

You can see here how to combine the surrogates:

A surrogate pair denotes the code point 0x10000 + (H − 0xD800) × 0x400

(L − 0xDC00) where H and L are the numeric values of the high and low surrogates respectively.

If you combine them, you'll get:

0x10000 + (0xD801 - 0xD800) * 0x400 + (0xDC56 - 0xDC00) = 0x10456, which is 𐑖

However, the high and low surrogates were designed for UTF-16 representation of characters which do not fit into 16 bits, and this is what unicode.org says about using such surrogate pairs in UTF-8:

Q: How do I convert a UTF-16 surrogate pair such as <D800 DC00> to UTF-8? As one 4-byte sequence or as two separate 3-byte sequences?

A: The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single 4-byte sequence. However, there is a widespread practice of generating pairs of 3-byte sequences in older software, especially software which pre-dates the introduction of UTF-16 or that is interoperating with UTF-16 environments under particular constraints. Such an encoding is not conformant to UTF-8 as defined. See UTR #26: Compatability Encoding Scheme for UTF-16: 8-bit (CESU) for a formal description of such a non-UTF-8 data format. When using CESU-8, great care must be taken that data is not accidentally treated as if it was UTF-8, due to the similarity of the formats. [AF]

The key point here is "Such an encoding is not conformant to UTF-8 as defined". So, your input is in fact an invalid UTF-8 sequence, and Python rejected it as such.

To answer the question:

https://mothereff.in/utf-8 is ignoring the unicode.org's instruction to treat this as invalid.
Python is treating this as invalid.
If you want to decode it, even though it is invalid, you can write a function which does what I did manually.

198

answered Sep 18 '22 00:09

zvone

Related questions
                            
                                How to proper setup python source root direcory in Visual Studio Code?
                            
                                Why isnt the output showing k1, k2, k3?
                            
                                matplotlib color gradient between two colors
                            
                                Using slack API: "Module slack has no attribute WebClient "
                            
                                Map colors in image to closest member of a list of colors, in Python
                            
                                How to run python script with mpi4py (using mpiexec) from within pycharm?
                            
                                How do I filter a list in glom based on list index?
                            
                                Why is my ipywidget observe being call multiple times on a single state change?
                            
                                How to prevent deletion of Django model from Django Admin, unless part of a cascade
                            
                                Using Fixtures vs passing method as argument
                            
                                Why decimals do not interoperate with floats
                            
                                Avoiding overflow in log(cosh(x))
                            
                                Is it possible to merge two sets such that all references to both sets will refer to the new? [closed]
                            
                                How to reset TTL when a redis key is accessed?
                            
                                How to fail setup in Locust?
                            
                                Overloading constructors in Python [duplicate]
                            
                                Generate Python dictionary from combination of lists
                            
                                I found a mistake in the book "Grokking Algorithms"
                            
                                Getting kernel error while trying to open Jupyter notebook or Spyder
                            
                                Chapel-Python integration questions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Enable to decode/encode correctly 𐑖𐑱𐑝𐑾𐑯 𐑨𐑤𐑓𐑩𐑚𐑧𐑑 from bytes in python 3.7.3

Tags:

python

utf-8

decode

cesu-8

Folkvir

People also ask

1 Answers

zvone

Recent Activity

Donate For Us