I'm using this method to remove accents from my strings:
static string RemoveAccents(string input)
{
string normalized = input.Normalize(NormalizationForm.FormKD);
StringBuilder builder = new StringBuilder();
foreach (char c in normalized)
{
if (char.GetUnicodeCategory(c) !=
UnicodeCategory.NonSpacingMark)
{
builder.Append(c);
}
}
return builder.ToString();
}
but this method leaves đ as đ and doesn't change it to d, even though d is its base char. you can try it with this input string "æøåáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ"
What's so special in letter đ?
The answer for why it doesn't work is that the statement that "d is its base char" is false. U+0111 (LATIN SMALL LETTER D WITH STROKE) has Unicode category "Letter, Lowercase" and has no decomposition mapping (i.e., it doesn't decompose to "d" followed by a combining mark).
"đ".Normalize(NormalizationForm.FormD)
simply returns "đ"
, which is not stripped out by the loop because it is not a non-spacing mark.
A similar issue will exist for "ø" and other letters for which Unicode provides no decomposition mapping. (And if you're trying to find the "best" ASCII character to represent a Unicode letter, this approach won't work at all for Cyrillic, Greek, Chinese or other non-Latin alphabets; you'll also run into problems if you wanted to transliterate "ß" into "ss", for example. Using a library like UnidecodeSharp may help.)
I have to admit that I'm not sure why this works but it sure seems to
var str = "æøåáâăäĺćçčéęëěíîďđńňóôőöřůúűüýţ";
var noApostrophes = Encoding.ASCII.GetString(Encoding.GetEncoding("Cyrillic").GetBytes(str));
=> "aoaaaaalccceeeeiiddnnooooruuuuyt"
"D with stroke" (Wikipedia) is used in several languages, and appears to be considered a distinct letter in all of them -- and that is why it remains unchanged.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With