Odd character codes: <blockquote> ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้ </blockquote> Question: What's the encoding of these characters? (Tip: Try editing this question and you'll see why they're odd, LIVE) Yeah, that's right. You see the same thing I do. Apparently, this came from a mac. So, with the little knowledge of the subject I have, I fired up notepad++ and tried to view it in hex. The result? Try it yourself: http://notepad-plus-plus.org/ Fairly obvious; What the hell? I can understand if it is <code>Just a Bunch of Bits</code> in some weird proprietary binary encoding (containing stuff like color, font, etc. etc.). But why do they show up so strange? Also, why do notepad++ not show the original characters from the beginning? If you turn on the hex-editor and then turn it off, it's like it expands. (Also (again), try copy-pasting the above characters twice into notepad++. See the difference? Nothing but <code>0x3f</code> and the occasional <code>0x20</code>. This is also true for each individual character. As far as I know, neither a space nor a question-mark looks like the above characters. But oh, I may be wrong of course..) Here's a snippet from outlook: <img src="https://i.stack.imgur.com/l0Osy.jpg" alt="Do you see that?!?!"> EDIT: Editing these characters using <code>UTF-8</code> instead of stupid <code>ANSI</code> actually lets you see the correct bytes. EDIT 2: I probably should have been more clear in what I wanted to know when I wrote the question (in my defence, I was so grossed out I just wanted to scream <code>BRAINOVERFLOW</code> when I saw it [the screenshot]). EDIT 3: (copied from yahoo answer) It appears to be a thing called "stacking diacritics" using Thai characters. Essentially the Thai character ก "ko kai" can have any of several superscripted diacritic marks such as ็ "maitaikhu". If you follow "ko kai" with "maitaikhu", the latter appears as a superscript thus: ก็ If you put further diacritics after such a combination, they'll stack thus: ก็็็็็ Here are the characters that will do it: http://graphemica.com/search?q=%E0%B8%81…

Easy search on gnome charmap: <pre class="prettyprint"><code>U+0E01 THAI CHARACTER KO KAI General Character Properties In Unicode since: 1.1 Unicode category: Letter, Other Various Useful Representations UTF-8: 0xE0 0xB8 0x81 UTF-16: 0x0E01 C octal escaped UTF-8: \340\270\201 XML decimal entity: &#3585; </code></pre> followed by (one or more of / a variation of): <pre class="prettyprint"><code>U+0E47 THAI CHARACTER MAITAIKHU General Character Properties In Unicode since: 1.1 Unicode category: Mark, Non-Spacing Various Useful Representations UTF-8: 0xE0 0xB9 0x87 UTF-16: 0x0E47 C octal escaped UTF-8: \340\271\207 XML decimal entity: &#3655; Annotations and Cross References Alias names: • mai taikhu </code></pre> The second is a non-spacing mark decorating the first char

Entering those characters in the search box on Graphmenica will take you to this page, showing the different characters being used: <ul> <li> ก thai character ko kai (Unicode code point: U+0E01)</li> <li> ิ thai character sara i (Unicode code point: U+0E34)</li> <li> ็ thai character maitaikhu (Unicode code point: U+0E47)</li> <li> ้ thai character mai tho (Unicode code point: U+0E49)</li> </ul>

What's the character encoding used? [closed]

Tags:

character-encoding

byte

Odd character codes:

ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้

Question: What's the encoding of these characters?

(Tip: Try editing this question and you'll see why they're odd, LIVE)

Yeah, that's right. You see the same thing I do.
Apparently, this came from a mac. So, with the little knowledge of the subject I have, I fired up notepad++ and tried to view it in hex.

The result? Try it yourself: http://notepad-plus-plus.org/

Fairly obvious; What the hell? I can understand if it is Just a Bunch of Bits in some weird proprietary binary encoding (containing stuff like color, font, etc. etc.). But why do they show up so strange?

Also, why do notepad++ not show the original characters from the beginning? If you turn on the hex-editor and then turn it off, it's like it expands.

(Also (again), try copy-pasting the above characters twice into notepad++. See the difference? Nothing but 0x3f and the occasional 0x20. This is also true for each individual character. As far as I know, neither a space nor a question-mark looks like the above characters. But oh, I may be wrong of course..)

Here's a snippet from outlook:

Do you see that?!?!

EDIT: Editing these characters using UTF-8 instead of stupid ANSI actually lets you see the correct bytes.

EDIT 2: I probably should have been more clear in what I wanted to know when I wrote the question (in my defence, I was so grossed out I just wanted to scream BRAINOVERFLOW when I saw it [the screenshot]).

EDIT 3: (copied from yahoo answer) It appears to be a thing called "stacking diacritics" using Thai characters.

Essentially the Thai character ก "ko kai" can have any of several superscripted diacritic marks such as ็ "maitaikhu". If you follow "ko kai" with "maitaikhu", the latter appears as a superscript thus: ก็

If you put further diacritics after such a combination, they'll stack thus: ก็็็็็

Here are the characters that will do it: http://graphemica.com/search?q=%E0%B8%81…

867

asked Feb 16 '12 11:02

Marcus Hansson

2 Answers

Easy search on gnome charmap:

U+0E01 THAI CHARACTER KO KAI

General Character Properties

In Unicode since: 1.1
Unicode category: Letter, Other

Various Useful Representations

UTF-8: 0xE0 0xB8 0x81
UTF-16: 0x0E01

C octal escaped UTF-8: \340\270\201
XML decimal entity: &#3585;

followed by (one or more of / a variation of):

U+0E47 THAI CHARACTER MAITAIKHU

General Character Properties

In Unicode since: 1.1
Unicode category: Mark, Non-Spacing

Various Useful Representations

UTF-8: 0xE0 0xB9 0x87
UTF-16: 0x0E47

C octal escaped UTF-8: \340\271\207
XML decimal entity: &#3655;

Annotations and Cross References

Alias names:
 • mai taikhu

The second is a non-spacing mark decorating the first char

answered Nov 24 '22 07:11

guido

Entering those characters in the search box on Graphmenica will take you to this page, showing the different characters being used:

ก thai character ko kai (Unicode code point: U+0E01)
ิ thai character sara i (Unicode code point: U+0E34)
็ thai character maitaikhu (Unicode code point: U+0E47)
้ thai character mai tho (Unicode code point: U+0E49)

answered Nov 24 '22 08:11

Mathias Bynens

Related questions
                            
                                Ruby character encoding when using Base64.encode
                            
                                Converting special charactes such as Ã¼ and Ãƒ back to their original, latin alphbet counterparts in C#
                            
                                How to find out Chinese or Japanese Character in a String in Python?
                            
                                What is "ANSI as UTF-8" and how can I make fputcsv() generate UTF-8 w/BOM?
                            
                                Converting text file from ANSI to ASCII using C#
                            
                                Why does Java's String.getBytes() uses "ISO-8859-1"
                            
                                Change Tomcat's Charset.defaultCharset in windows
                            
                                Decoding HTML entities with Python
                            
                                How to detect UTF-8 characters in a Latin1 encoded column - MySQL
                            
                                ruby 1.9, force_encoding, but check
                            
                                Spring/Rest @PathVariable character encoding
                            
                                Is there a HTML/CSS way to display HTML tags without parsing?
                            
                                Creating mysql table with explicit default character set, what if I don't?
                            
                                Passing binary data as arguments in bash
                            
                                Passing request parameters as UTF-8 encoded strings [duplicate]
                            
                                UTF-8 encoidng issue when exporting csv file , JavaScript
                            
                                How does UTF-8 encoding identify single byte and double byte characters?
                            
                                utf 8 - PHP and MySQLi UTF8 [duplicate]
                            
                                Java Spring resttemplate character encoding
                            
                                How to add a UTF-8 BOM in Java?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With