Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the character encoding used? [closed]

Odd character codes:

ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้

Question: What's the encoding of these characters?


(Tip: Try editing this question and you'll see why they're odd, LIVE)

Yeah, that's right. You see the same thing I do.
Apparently, this came from a mac. So, with the little knowledge of the subject I have, I fired up notepad++ and tried to view it in hex.

The result? Try it yourself: http://notepad-plus-plus.org/

Fairly obvious; What the hell? I can understand if it is Just a Bunch of Bits in some weird proprietary binary encoding (containing stuff like color, font, etc. etc.). But why do they show up so strange?


Also, why do notepad++ not show the original characters from the beginning? If you turn on the hex-editor and then turn it off, it's like it expands.


(Also (again), try copy-pasting the above characters twice into notepad++. See the difference? Nothing but 0x3f and the occasional 0x20. This is also true for each individual character. As far as I know, neither a space nor a question-mark looks like the above characters. But oh, I may be wrong of course..)

Here's a snippet from outlook:

Do you see that?!?!

EDIT: Editing these characters using UTF-8 instead of stupid ANSI actually lets you see the correct bytes.

EDIT 2: I probably should have been more clear in what I wanted to know when I wrote the question (in my defence, I was so grossed out I just wanted to scream BRAINOVERFLOW when I saw it [the screenshot]).

EDIT 3: (copied from yahoo answer) It appears to be a thing called "stacking diacritics" using Thai characters.

Essentially the Thai character ก "ko kai" can have any of several superscripted diacritic marks such as ็ "maitaikhu". If you follow "ko kai" with "maitaikhu", the latter appears as a superscript thus: ก็

If you put further diacritics after such a combination, they'll stack thus: ก็็็็็

Here are the characters that will do it: http://graphemica.com/search?q=%E0%B8%81…

like image 867
Marcus Hansson Avatar asked Feb 16 '12 11:02

Marcus Hansson


People also ask

What are the 3 types of character encoding?

There are three different Unicode character encodings: UTF-8, UTF-16 and UTF-32.

What character encoding should I use?

As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need.

Is UTF-8 still used?

UTF-8 is the dominant encoding for the World Wide Web (and internet technologies), accounting for 98% of all web pages, and up to 100.0% for some languages, as of 2022.

What is the most used character encoding?

UTF-8 is the most commonly used encoding scheme used on today's computer systems and computer networks.


2 Answers

Easy search on gnome charmap:

U+0E01 THAI CHARACTER KO KAI

General Character Properties

In Unicode since: 1.1
Unicode category: Letter, Other

Various Useful Representations

UTF-8: 0xE0 0xB8 0x81
UTF-16: 0x0E01

C octal escaped UTF-8: \340\270\201
XML decimal entity: ก

followed by (one or more of / a variation of):

U+0E47 THAI CHARACTER MAITAIKHU

General Character Properties

In Unicode since: 1.1
Unicode category: Mark, Non-Spacing

Various Useful Representations

UTF-8: 0xE0 0xB9 0x87
UTF-16: 0x0E47

C octal escaped UTF-8: \340\271\207
XML decimal entity: ็

Annotations and Cross References

Alias names:
 • mai taikhu

The second is a non-spacing mark decorating the first char

like image 60
guido Avatar answered Nov 24 '22 07:11

guido


Entering those characters in the search box on Graphmenica will take you to this page, showing the different characters being used:

  • ก thai character ko kai (Unicode code point: U+0E01)
  • ิ thai character sara i (Unicode code point: U+0E34)
  • ็ thai character maitaikhu (Unicode code point: U+0E47)
  • ้ thai character mai tho (Unicode code point: U+0E49)
like image 37
Mathias Bynens Avatar answered Nov 24 '22 08:11

Mathias Bynens