Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does .NET's String.Normalize do?

Tags:

string

.net

The MSDN article on String.Normalize states simply:

Returns a new string whose binary representation is in a particular Unicode normalization form.

And sometimes referring to a "Unicode normalization form C."

I'm just wondering, what does that mean? How is this function useful in real life situations?

like image 566
GeReV Avatar asked Jul 20 '10 08:07

GeReV


People also ask

What does it mean to normalize a string?

The precomposed form has a canonical decomposition that makes the two representations canonically equivalent. Normalizing a string essentially means consistently picking one of these equivalent encodings, that is, either all composed or all decomposed. By contrast, unnormalized data may contain both forms.

What does normalize do in JS?

normalize() is an inbuilt method in javascript which is used to return a Unicode normalisation form of a given input string. If the given input is not a string, then at first it will be converted into a string then this method will work.

What is Unicode normalization form?

The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.


2 Answers

One difference between form C and form D is how letters with accents are represented: form C uses a single letter-with-accent codepoint, while form D separates that into a letter and an accent.

For instance, an "à" can be codepoint 224 ("Latin small letter A with grave"), or codepoint 97 ("Latin small letter A") followed by codepoint 786 ("Combining grave accent"). A char-by-char comparison would see these as different. Normalisation lets the comparison succeed.

A side-effect is that this makes it possible to easily create a "remove accents" method.

public static string RemoveAccents(string input) {     return new string(input         .Normalize(System.Text.NormalizationForm.FormD)         .ToCharArray()         .Where(c => CharUnicodeInfo.GetUnicodeCategory(c) != UnicodeCategory.NonSpacingMark)         .ToArray());     // the normalization to FormD splits accented letters in letters+accents     // the rest removes those accents (and other non-spacing characters)     // and creates a new string from the remaining chars } 
like image 123
Hans Kesting Avatar answered Sep 28 '22 04:09

Hans Kesting


It makes sure that unicode strings can be compared for equality (even if they are using different unicode encodings).

From Unicode Standard Annex #15:

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.

like image 36
Oded Avatar answered Sep 28 '22 05:09

Oded