The MSDN article on String.Normalize states simply:
Returns a new string whose binary representation is in a particular Unicode normalization form.
And sometimes referring to a "Unicode normalization form C."
I'm just wondering, what does that mean? How is this function useful in real life situations?
The precomposed form has a canonical decomposition that makes the two representations canonically equivalent. Normalizing a string essentially means consistently picking one of these equivalent encodings, that is, either all composed or all decomposed. By contrast, unnormalized data may contain both forms.
normalize() is an inbuilt method in javascript which is used to return a Unicode normalisation form of a given input string. If the given input is not a string, then at first it will be converted into a string then this method will work.
The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.
One difference between form C and form D is how letters with accents are represented: form C uses a single letter-with-accent codepoint, while form D separates that into a letter and an accent.
For instance, an "à" can be codepoint 224 ("Latin small letter A with grave"), or codepoint 97 ("Latin small letter A") followed by codepoint 786 ("Combining grave accent"). A char-by-char comparison would see these as different. Normalisation lets the comparison succeed.
A side-effect is that this makes it possible to easily create a "remove accents" method.
public static string RemoveAccents(string input) { return new string(input .Normalize(System.Text.NormalizationForm.FormD) .ToCharArray() .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark) .ToArray()); // the normalization to FormD splits accented letters in letters+accents // the rest removes those accents (and other non-spacing characters) // and creates a new string from the remaining chars }
It makes sure that unicode strings can be compared for equality (even if they are using different unicode encodings).
From Unicode Standard Annex #15:
Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With