ด้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้дด็็็็็้้้้้็็็็้้้้้็็็็็้้้้้็็็็็้้้้้็็็็็้้้้้
I found some interesting characters just as I pasted above which takes only 3 spaces width. However the actual length of the string is 380.
I inspected the string in python, and the string encode is as following:
'\xe0\xb8\x94\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xd0\xb4\xe0\xb8\x94\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x87\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89\xe0\xb9\x89'
It seems that the string is a combination of three thai character:
ด \xe0\xb8\x94 THAI CHARACTER DO DEK ้ \xe0\xb9\x89 THAI CHARACTER MAI THO ็ \xe0\xb9\x87 THAI CHARACTER MAITAIKHU
And my questions are:
UPDATE
I've tested the characters with more browsers, and the long tail only appears in chrome and firefox on the windows platform.
Following are screenshot I've taken:
win 7 ie8
ubuntu firefox
win 7 chrome
win 7 firefox
Therefore, I guess it is a browser related bug.
The Thai script (Thai: อักษรไทย, RTGS: akson thai) is the abugida used to write Thai, Southern Thai and many other languages spoken in Thailand.
Thai is a Unicode block containing characters for the Thai, Lanna Tai, and Pali languages. It is based on the Thai Industrial Standard 620-2533.
Thai is written across the page from left to right, with no spaces between words, for example: You are trying to view Flash content, but you have no Flash plugin installed.
In total, there are 72 characters in the Thai alphabet. This can be split up into 44 consonants and 28 vowels.
There are two problem, one in the output system (font renderer) which is not Thai aware and one in the input system that generated this text in the first place.
If you had done your homework, you would know that mai tho and maitaikhu (UniCode names) are what UniCode refers to as Non Spacing Markers (NSM). This means that the font renderer should not move to the next character cell when displaying this glyph.
In order to avoid the mess you see above, the Thai API Consortium (TAPIC) made the WTT 2.0 standard that describes both how the font rendering algorithm should handle Thai letter order when it receives it as input and also how the input method should allow such characters to be input if you attempt to type them.
Standardization and Implementations of Thai Language Overview
libthai includes both input and output methods.
thaicheck is a small program that can detect letter sequence problems and fix them.
By the way, you cannot have a sequence (word) of do dek, mai tho and maitaikhu; the input sequence is noise.
Bear in mind that some editors have broken input methods that allow typing multiple NSM that cannot be combined but the output method will render only legal sequences; the result being an illegal input string that looks OK to the user on his system.
The codes you mention are all in UTF-8, which is why each character needs 3 bytes. The respectice Unicode codes are:
DO DEK 0x0e14
MAI THO 0x0e49
MAITAIKHU 0x0e47
The latter two are in the category Mark, Nonspacing
, and have the Combine
property (Canonical_Combining_Class) set to 107, meaning that the code points are combined with the preceding code point in rendering.
You example starts with a single character and adds lots of nonspacing marks on top of it.
Compare with this C# code:
char DODEK = (char)0x0e14; char MAITHO = (char)0x0e49; char MAITAIKHU = (char)0x0e47; string thai = new string(new char[] { DODEK, MAITHO, MAITAIKHU }); Console.WriteLine("number of code points: " + thai.Length); var si = new System.Globalization.StringInfo(thai); Console.WriteLine("number of text elements: " + si.LengthInTextElements);
Output:
number of code points: 3 number of text elements: 1
See also .Net StringInfo class.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With