I fall into a surprising issue.
I loaded a text file in my application and I have some logic which compares the value having µ.
And I realized that even if the texts are same the compare value is false.
Console.WriteLine("μ".Equals("µ")); // returns false Console.WriteLine("µ".Equals("µ")); // return true
In later line the character µ is copy pasted.
However, these might not be the only characters that are like this.
Is there any way in C# to compare the characters which look the same but are actually different?
Unicode is explicitly defined such as to overlap in that same range with ASCII. Thus, if you look at the character codes in your string, and it contains anything that is higher than 127, the string contains Unicode characters that are not ASCII characters. Note, that ASCII includes only the English alphabet.
𪚥 is the most complex unicode Chinese character by strokes (64).
The code point is a unique number for a character or some symbol such as an accent mark or ligature. Unicode supports more than a million code points, which are written with a "U" followed by a plus sign and the number in hex; for example, the word "Hello" is written U+0048 U+0065 U+006C U+006C U+006F (see hex chart).
However, there are lots of sets of characters that look alike but aren't equivalent under any Unicode normalization form. For example, A (Latin), Α (Greek), and А (Cyrillic). The Unicode website has a confusables.txt file with a list of these, intended to help developers guard against homograph attacks.
Unicode Lookup is an online reference tool to lookup Unicode and HTML special characters, by name and number, and convert between their decimal, hexadecimal, and octal bases. Contains 1,114,112 characters.
In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character.
Type any string to search for Unicode characters and HTML/XHTML entities by name. Enter any single character to find details on that character. Type any number to search by codepoint: 123 decimal number. 0371 octal. 0x1D351 hexadecimal.
Because it is really different symbols even they look the same, first is the actual letter and has char code = 956 (0x3BC)
and the second is the micro sign and has 181 (0xB5)
.
References:
So if you want to compare them and you need them to be equal, you need to handle it manually, or replace one char with another before comparison. Or use the following code:
public void Main() { var s1 = "μ"; var s2 = "µ"; Console.WriteLine(s1.Equals(s2)); // false Console.WriteLine(RemoveDiacritics(s1).Equals(RemoveDiacritics(s2))); // true } static string RemoveDiacritics(string text) { var normalizedString = text.Normalize(NormalizationForm.FormKC); var stringBuilder = new StringBuilder(); foreach (var c in normalizedString) { var unicodeCategory = CharUnicodeInfo.GetUnicodeCategory(c); if (unicodeCategory != UnicodeCategory.NonSpacingMark) { stringBuilder.Append(c); } } return stringBuilder.ToString().Normalize(NormalizationForm.FormC); }
And the Demo
In many cases, you can normalize both of the Unicode characters to a certain normalization form before comparing them, and they should be able to match. Of course, which normalization form you need to use depends on the characters themselves; just because they look alike doesn't necessarily mean they represent the same character. You also need to consider if it's appropriate for your use case — see Jukka K. Korpela's comment.
For this particular situation, if you refer to the links in Tony's answer, you'll see that the table for U+00B5 says:
Decomposition <compat> GREEK SMALL LETTER MU (U+03BC)
This means U+00B5, the second character in your original comparison, can be decomposed to U+03BC, the first character.
So you'll normalize the characters using full compatibility decomposition, with the normalization forms KC or KD. Here's a quick example I wrote up to demonstrate:
using System; using System.Text; class Program { static void Main(string[] args) { char first = 'μ'; char second = 'µ'; // Technically you only need to normalize U+00B5 to obtain U+03BC, but // if you're unsure which character is which, you can safely normalize both string firstNormalized = first.ToString().Normalize(NormalizationForm.FormKD); string secondNormalized = second.ToString().Normalize(NormalizationForm.FormKD); Console.WriteLine(first.Equals(second)); // False Console.WriteLine(firstNormalized.Equals(secondNormalized)); // True } }
For details on Unicode normalization and the different normalization forms refer to System.Text.NormalizationForm
and the Unicode spec.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With