Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to normalize fancy-looking unicode string in C#?

Tags:

I receive from a REST API a text with this kind of style, for example

  • 𝓗𝓸𝔀 𝓽𝓸 𝓻𝓮𝓶𝓸𝓿𝓮 𝓽𝓱𝓲𝓼 𝓯𝓸𝓷𝓽 𝓯𝓻𝓸𝓶 𝓪 𝓼𝓽𝓻𝓲𝓷𝓰?

  • 𝐻𝑜𝓌 𝓉𝑜 𝓇𝑒𝓂𝑜𝓋𝑒 𝓉𝒽𝒾𝓈 𝒻𝑜𝓃𝓉 𝒻𝓇𝑜𝓂 𝒶 𝓈𝓉𝓇𝒾𝓃𝑔?

  • нσω тσ яємσνє тнιѕ ƒσηт ƒяσм α ѕтяιηg?

But this is not italic or bold or underlined since the type it's string. This kind of text make it failed my Regex ^[a-zA-Z0-9._]*$

I would like to normalize this string received in a standard one in order to make my Regex still valid.

like image 873
Luigi Saggese Avatar asked May 22 '20 16:05

Luigi Saggese


People also ask

What is Unicode normalization form?

The standard also defines a text normalization procedure, called Unicode normalization, that replaces equivalent sequences of characters so that any two texts that are equivalent will be reduced to the same sequence of code points, called the normalization form or normal form of the original text.

What does Unicode normalize do?

Essentially, the Unicode Normalization Algorithm puts all combining marks in a specified order, and uses rules for decomposition and composition to transform each string into one of the Unicode Normalization Forms. A binary comparison of the transformed strings will then determine equivalence.

What is character normalization?

Character normalization is a process that can improve recall. Improving recall by character normalization means that more documents are retrieved even if the documents do not exactly match the query.

What is normalized string?

The precomposed form has a canonical decomposition that makes the two representations canonically equivalent. Normalizing a string essentially means consistently picking one of these equivalent encodings, that is, either all composed or all decomposed. By contrast, unnormalized data may contain both forms.


1 Answers

You can use Unicode Compatibility normalization forms, which use Unicode's own (lossy) character mappings to transform letter-like characters (among other things) to their simplified equivalents.

In python, for instance:

>>> from unicodedata import normalize
>>> normalize('NFKD','𝓗𝓸𝔀 𝓽𝓸 𝓻𝓮𝓶𝓸𝓿𝓮 𝓽𝓱𝓲𝓼 𝓯𝓸𝓷𝓽 𝓯𝓻𝓸𝓶 𝓪 𝓼𝓽𝓻𝓲𝓷𝓰')
'How to remove this font from a string'

# EDIT: This one wouldn't work
>>> normalize('NFKD','нσω тσ яємσνє тнιѕ ƒσηт ƒяσм α ѕтяιηg?')
'нσω тσ яємσνє тнιѕ ƒσηт ƒяσм α ѕтяιηg?'

Interactive example here.

EDIT: Note that this only applies to stylistic forms (superscripts, blackletter, fill-width, etc.), so your third example, which uses non-latin characters, can't be decomposed to ASCII.

EDIT2: I didn't realize your question was specific to C#, here's the documentation for String.Normalize, which does just that:

string s1 = "𝓗𝓸𝔀 𝓽𝓸 𝓻𝓮𝓶𝓸𝓿𝓮 𝓽𝓱𝓲𝓼 𝓯𝓸𝓷𝓽 𝓯𝓻𝓸𝓶 𝓪 𝓼𝓽𝓻𝓲𝓷𝓰"
string s2 = s1.Normalize(NormalizationForm.FormKD)
like image 151
VLRoyrenn Avatar answered Sep 22 '22 18:09

VLRoyrenn