How to compare Unicode characters that "look alike"?

I fall into a surprising issue.

I loaded a text file in my application and I have some logic which compares the value having µ.

And I realized that even if the texts are same the compare value is false.

 Console.WriteLine("μ".Equals("µ")); // returns false  Console.WriteLine("µ".Equals("µ")); // return true 

In later line the character µ is copy pasted.

However, these might not be the only characters that are like this.

Is there any way in C# to compare the characters which look the same but are actually different?

2 Answers

Because it is really different symbols even they look the same, first is the actual letter and has char code = 956 (0x3BC) and the second is the micro sign and has 181 (0xB5).


  • Unicode Character 'GREEK SMALL LETTER MU' (U+03BC)
  • Unicode Character 'MICRO SIGN' (U+00B5)

So if you want to compare them and you need them to be equal, you need to handle it manually, or replace one char with another before comparison. Or use the following code:

public void Main() {     var s1 = "μ";     var s2 = "µ";      Console.WriteLine(s1.Equals(s2));  // false     Console.WriteLine(RemoveDiacritics(s1).Equals(RemoveDiacritics(s2))); // true  }  static string RemoveDiacritics(string text)  {     var normalizedString = text.Normalize(NormalizationForm.FormKC);     var stringBuilder = new StringBuilder();      foreach (var c in normalizedString)     {         var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c);         if (unicodeCategory != UnicodeCategory.NonSpacingMark)         {             stringBuilder.Append(c);         }     }      return stringBuilder.ToString().Normalize(NormalizationForm.FormC); } 

In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.

For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:

Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)

This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.

So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:

using System; using System.Text;  class Program {     static void Main(string[] args)     {         char first = 'μ';         char second = 'µ';          // Technically you only need to normalize U+00B5 to obtain U+03BC, but         // if you're unsure which character is which, you can safely normalize both         string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD);         string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD);          Console.WriteLine(first.Equals(second));                     // False         Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True     } } 

For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm and the Unicode spec.

