What's up with these Unicode combining characters and how can we filter them?

Tags:

กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้ ก็็็็็็็็็็็็็็็็็็็็ กิิิิิิิิิิิิิิิิิิิิ ก้้้้้้้้้้้้้้้้้้้้

These recently showed up in facebook comment sections.

How can we sanitize this?

541

asked May 02 '12 13:05

XCS

1 Answers

What's up with these unicode characters?

That's a character with a series of combining characters. Because the combining characters in question want to go above the base character, they stack up (literally). For instance, the case of

ก้้้้้้้้้้้้้้้้้้้้

...it's an ก (Thai character ko kai) (U+0E01) followed by 20 copies of the Thai combining character mai tho (U+0E49).

How can we sanitize this?

You could pre-process the text and limit the number of combining characters that can be applied to a single character, but the effort may not be worth the reward. You'd need the data sheets for all the current characters so you'd know whether they were combining or what, and you'd need to be sure to allow at least a few because some languages are written with several diacritics on a single base. Now, if you want to limit comments to the Latin character set, that would be an easier range check, but of course that's only an option if you want to limit comments to just a few languages. More information, code sheets, etc. at unicode.org.

BTW, if you ever want to know how some character was composed, for another question just recently I coded up a quick-and-dirty "Unicode Show Me" page on JSBin. You just copy and paste the text into the text area, and it shows you all of the code points (~characters) that the text is made up of, with links such as those above to the page describing each character. It only works for code points in the range U+FFFF and under, because it's written in JavaScript and to handle characters above U+FFFF in JavaScript you have to do more work than I wanted to do for that question (because in JavaScript, a "character" is always 16 bits, which means for some languages a character can be split across two separate JavaScript "characters" and I didn't account for that), but it's handy for most texts...

answered Oct 14 '22 15:10

T.J. Crowder

Related questions
                            
                                Unicode via CSS :before
                            
                                Where is Python's "best ASCII for this Unicode" database? [closed]
                            
                                Trouble with UTF-8 characters; what I see is not what I stored
                            
                                How do you change the character encoding of a postgres database?
                            
                                What's "wrong" with C++ wchar_t and wstrings? What are some alternatives to wide characters?
                            
                                Get a list of all the encodings Python can encode to
                            
                                Unicode encoding for string literals in C++11
                            
                                How many characters can be mapped with Unicode?
                            
                                Is TCHAR still relevant?
                            
                                <0xEF,0xBB,0xBF> character showing up in files. How to remove them?
                            
                                Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars
                            
                                WebClient.DownloadString results in mangled characters due to encoding issues, but the browser is OK
                            
                                Regex: what is InCombiningDiacriticalMarks?
                            
                                How to convert a string with Unicode encoding to a string of letters
                            
                                Use of 'use utf8;' gives me 'Wide character in print'
                            
                                What are the differences between utf8_general_ci and utf8_unicode_ci? [duplicate]
                            
                                MySQL VARCHAR Lengths and UTF-8
                            
                                sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings
                            
                                FPDF utf-8 encoding (HOW-TO)
                            
                                List of all unicode's open/close brackets?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's up with these Unicode combining characters and how can we filter them?

Tags:

unicode

zalgo

sanitize

combining-marks

XCS

People also ask

1 Answers

T.J. Crowder

Recent Activity

Donate For Us