Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What string of characters should a source send to disambiguate the byte-encoding they are using?

I'm decoding bytestreams into unicode characters without knowing the encoding that's been used by each of a hundred or so senders.

Many of the senders are not technically astute, and will not be able to tell me what encoding they are using. It will be determined by the happenstance of the toolchains they are using to generate the data.

The senders are, for the moment, all UK/English based, using a variety of operating systems.

Can I ask all the senders to send me a particular string of characters that will unambiguously demonstrate what encoding each sender is using?

I understand that there are libraries that use heuristics to guess at the encoding - I'm going to chase that up too, as a runtime fallback, but first I'd like to try and determine what encodings are being used, if I can.

(Don't think it's relevant, but I'm working in Python)

like image 728
Jonathan Hartley Avatar asked Oct 06 '22 21:10

Jonathan Hartley


1 Answers

A full answer to this question depends on a lot of factors, such as the range of encodings used by the various upstream systems, and how well your users will comply with instructions to type magic character sequences into text fields, and how skilled they will be at the obscure keyboard combinations to type the magic character sequences.

There are some very easy character sequences which only some users will be able to type. Only users with a Cyrillic keyboard and encoding will find it easy to type "Ильи́ч" (Ilyich), and so you only have to distinguish between the Cyrillic-capable encodings like UTF-8, UTF-16, iso8859_5, and koi8_r. Similarly, you could come up with Japanese, Chinese, and Korean character sequences which distinguish between users of Japanese, simplified Chinese, traditional Chinese, and Korean systems.

But let's concentrate on users of western European computer systems, and the common encodings like ISO-8859-15, Mac_Roman, UTF-8, UTF-16LE, and UTF-16BE. A very simple test is to have users enter the Euro character '€', U+20AC, and see what byte sequence gets generated:

  • byte ['\xa4'] means iso-8859-15 encoding
  • bytes ['\xe2', '\x82', '\xac'] mean utf-8 encoding
  • bytes ['\x00', '\xac'] mean utf-16be encoding
  • bytes ['\xac', '\x00'] mean utf-16le encoding
  • byte ['\x80'] means cp1252 ("Windows ANSI") encoding
  • byte ['\xdb'] means macroman encoding
  • iso-8859-1 won't be able to represent the Euro character at all. iso-8859-15 is the Euro-supporting successor to iso-8859-1.
  • U.S. users probably won't know how to type a Euro character. (OK, that's too snarky. 3% of them will know.)

You should check what each of these byte sequences, interpreted as any of the possible encodings, is not a character sequence that users would likely type themselves. For instance, the '\xa4' of the iso-8859-15 Euro symbol could also be the iso-8859-1 or cp1252 or UTF-16le encoding of '¤', the macroman encoding of '§', or the first byte of any of thousands of UTF-16 characters, such as U+A4xx Yi Syllables, or U+01A4 LATIN SMALL LETTER OI. It would not be a valid first byte of a UTF-8 sequence. If some of your users submit text in Yi, you might have a problem.

The Python 3.x documentation, 7.2.3. Standard Encodings lists the character encodings which the Python standard library can easily handle. The following program lets you see how a test character sequence is encoded into bytes by various encodings:

>>> for e in ['iso-8859-1','iso-8859-15', 'utf-8', 'utf-16be', 'utf-16le', \
... 'cp1252', 'macroman']:
...     print e, list( euro.encode(e, 'backslashreplace'))

So, as an expedient, satisficing hack, consider telling your users to type a '€' as the first character of a text field, if there are any problems with encoding. Then your system should interpret any of the above byte sequences as an encoding clue, and discard them. If users want to start their text content with a Euro character, they start the field with '€€'; the first gets swallowed, the second remains part of the text.

like image 109
Jim DeLaHunt Avatar answered Oct 10 '22 04:10

Jim DeLaHunt