How to protect against diacritics such as Zalgo text

Tags:

huh?

The character pictured above was tweeted a few months ago by Mikko Hyppönen, a computer security expert known for his work on computer viruses and TED talks on computer security. In respect for SO, I will only post an image of it, but you get the idea. It's obviously not something you'd want spreading around your website and freaking out visitors.

Upon further inspection, the character appears to be a letter of the Thai alphabet combined with over 87 diacritics (is there even a limit?!). This got me thinking about security, localization, and how one might handle this sort of input. My searching lead me to this question on Stack, and in turn a blog post from Michael Kaplan on stripping diacritics. In it, he demonstrates how one can decompose a string into its "base" characters (simplified here for the sake of brevity):

StringBuilder sb = new StringBuilder(); foreach (char c in "façade".Normalize(NormalizationForm.FormD)) {     if (char.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)         sb.Append(c); } Response.Write(sb.ToString()); // facade

I can see how that this is would be useful in some cases, but in terms of user input, it would be stripping out ALL diacritics. As Kaplan points out, removing the diacritics in some languages can completely change the meaning to the word. This begs the question: How does one permit some diacritics in user input/output, but exclude others extreme cases such as Mikko Hyppönen's über character?

940

asked Aug 15 '12 23:08

Derek Hunziker

1 Answers

is there even a limit?!

Not intrinsically in Unicode. There is the concept of a 'Stream-Safe' format in UAX-15 that sets a limit of 30 combiners... Unicode strings in general are not guaranteed to be Stream-Safe, but this could certainly be taken as a sign that Unicode don't intend to standardise new characters that would require a grapheme cluster longer than that.

30 is still an awful lot. The longest known natural-language grapheme cluster is the Tibetan Hakṣhmalawarayaṁ at 1 base plus 8 combiners, so for now it would be reasonable to normalise to NFD and disallow any sequence of more than 8 combiners in a row.

If you only care about common Western European languages you can probably bring that down to 2. So potentially compromise somewhere between those.

159

answered Oct 14 '22 00:10

bobince

Related questions
                            
                                Package requires NuGet client version '2.12'
                            
                                How can I subtract 6 hour from the current time? [duplicate]
                            
                                What is wrong with polling?
                            
                                Dynamic Lang. Runtime vs Reflection
                            
                                How to replace occurrences of "-" with an empty string?
                            
                                How to autoscroll on WPF datagrid
                            
                                How to replace value in list at same collection location [duplicate]
                            
                                Are variable prefixes (“Hungarian notation”) really necessary anymore? [closed]
                            
                                Cannot access a disposed object in ASP.NET Core when injecting DbContext
                            
                                How to change the braces/parenthesis colors in Visual Studio
                            
                                Combine return and switch
                            
                                Fast 2D graphics in WPF
                            
                                T4 alternative in .NET Core?
                            
                                Unable to use more than one processor group for my threads in a C# app
                            
                                Reducing duplicate error handling code in C#?
                            
                                WebApi - Bind from both Uri and Body
                            
                                MVC 5 on Mono: Could not load file or assembly 'System.Web.Entity' or one of its dependencies
                            
                                Add Custom Claim Types
                            
                                Task.WhenAny - What happens with remaining running tasks?
                            
                                How to prevent a .NET application from loading/referencing an assembly from the GAC?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to protect against diacritics such as Zalgo text

Tags:

c#

unicode

user-input

zalgo

diacritics

Derek Hunziker

People also ask

1 Answers

bobince

Recent Activity

Donate For Us