Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's up with these Unicode combining characters and how can we filter them?

กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้

These recently showed up in facebook comment sections.

How can we sanitize this?

like image 541
XCS Avatar asked May 02 '12 13:05

XCS


People also ask

How do you combine Unicode characters?

Depending from the application or browser there are two ways to use the Unicode Combining Diacritical Marks: With ā (a macron) as example, you may try to type in the 'a' first followed by the decimal code ̄ or ALT+ (it must be the + from the numeric keypad) followed by the hexadecimal code 0304 (i.e U+0304).

What is Unicode and how does Unicode help with converting characters to numbers?

Unicode provides a unique number for every character, regardless of platform, language, or program. Using Unicode, you can develop a software product that works with various platforms, languages, and countries. Unicode also allows data to be transported through many different systems.

What is Unicode of a character?

Unicode is an international character encoding standard that provides a unique number for every character across languages and scripts, making almost all characters accessible across platforms, programs, and devices.

What is an example of Unicode?

Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).


1 Answers

What's up with these unicode characters?

That's a character with a series of combining characters. Because the combining characters in question want to go above the base character, they stack up (literally). For instance, the case of

ก้้้้้้้้้้้้้้้้้้้้

...it's an ก (Thai character ko kai) (U+0E01) followed by 20 copies of the Thai combining character mai tho (U+0E49).

How can we sanitize this?

You could pre-process the text and limit the number of combining characters that can be applied to a single character, but the effort may not be worth the reward. You'd need the data sheets for all the current characters so you'd know whether they were combining or what, and you'd need to be sure to allow at least a few because some languages are written with several diacritics on a single base. Now, if you want to limit comments to the Latin character set, that would be an easier range check, but of course that's only an option if you want to limit comments to just a few languages. More information, code sheets, etc. at unicode.org.

BTW, if you ever want to know how some character was composed, for another question just recently I coded up a quick-and-dirty "Unicode Show Me" page on JSBin. You just copy and paste the text into the text area, and it shows you all of the code points (~characters) that the text is made up of, with links such as those above to the page describing each character. It only works for code points in the range U+FFFF and under, because it's written in JavaScript and to handle characters above U+FFFF in JavaScript you have to do more work than I wanted to do for that question (because in JavaScript, a "character" is always 16 bits, which means for some languages a character can be split across two separate JavaScript "characters" and I didn't account for that), but it's handy for most texts...

like image 64
T.J. Crowder Avatar answered Oct 14 '22 15:10

T.J. Crowder