Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to compare Unicode characters that "look alike"?

I fall into a surprising issue.

I loaded a text file in my application and I have some logic which compares the value having µ.

And I realized that even if the texts are same the compare value is false.

 Console.WriteLine("μ".Equals("µ")); // returns false  Console.WriteLine("µ".Equals("µ")); // return true 

In later line the character µ is copy pasted.

However, these might not be the only characters that are like this.

Is there any way in C# to compare the characters which look the same but are actually different?

like image 748
D J Avatar asked Dec 19 '13 06:12

D J


People also ask

How do I identify Unicode characters?

Unicode is explicitly defined such as to overlap in that same range with ASCII. Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters. Note, that ASCII includes only the English alphabet.

What is the most complex Unicode character?

𪚥 is the most complex unicode Chinese character by strokes (64).

What is an example of a Unicode character?

The code point is a unique number for a character or some symbol such as an accent mark or ligature. Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).

Are there any Unicode characters that look alike but aren't?

However, there are lots of sets of characters that look alike but aren't equivalent under any Unicode normalization form. For example, A (Latin), Α (Greek), and А (Cyrillic). The Unicode website has a confusables.txt file with a list of these, intended to help developers guard against homograph attacks.

What is Unicode Lookup?

Unicode Lookup is an online reference tool to lookup Unicode and HTML special characters, by name and number, and convert between their decimal, hexadecimal, and octal bases. Contains 1,114,112 characters.

Should I normalize Unicode characters before comparing them?

In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character.

How do I search for Unicode characters in a string?

Type any string to search for Unicode characters and HTML/XHTML entities by name. Enter any single character to find details on that character. Type any number to search by codepoint: 123 decimal number. 0371 octal. 0x1D351 hexadecimal.


2 Answers

Because it is really different symbols even they look the same, first is the actual letter and has char code = 956 (0x3BC) and the second is the micro sign and has 181 (0xB5).

References:

  • Unicode Character 'GREEK SMALL LETTER MU' (U+03BC)
  • Unicode Character 'MICRO SIGN' (U+00B5)

So if you want to compare them and you need them to be equal, you need to handle it manually, or replace one char with another before comparison. Or use the following code:

public void Main() {     var s1 = "μ";     var s2 = "µ";      Console.WriteLine(s1.Equals(s2));  // false     Console.WriteLine(RemoveDiacritics(s1).Equals(RemoveDiacritics(s2))); // true  }  static string RemoveDiacritics(string text)  {     var normalizedString = text.Normalize(NormalizationForm.FormKC);     var stringBuilder = new StringBuilder();      foreach (var c in normalizedString)     {         var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);         if (unicodeCategory != UnicodeCategory.NonSpacingMark)         {             stringBuilder.Append(c);         }     }      return stringBuilder.ToString().Normalize(NormalizationForm.FormC); } 

And the Demo

like image 179
Tony Avatar answered Oct 24 '22 09:10

Tony


In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.

For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:

Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)

This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.

So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:

using System; using System.Text;  class Program {     static void Main(string[] args)     {         char first = 'μ';         char second = 'µ';          // Technically you only need to normalize U+00B5 to obtain U+03BC, but         // if you're unsure which character is which, you can safely normalize both         string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);         string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);          Console.WriteLine(first.Equals(second));                     // False         Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True     } } 

For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm and the Unicode spec.

like image 20
BoltClock Avatar answered Oct 24 '22 09:10

BoltClock