I'm trying to remove diacritic characters from a pangram in Polish. I'm using code from Michael Kaplan's blog http://www.siao2.com/2007/05/14/2629747.aspx, however, with no success.
Consider following pangram: "Pchnąć w tę łódź jeża lub ośm skrzyń fig.". Everything works fine but for letter "ł", I still get "ł". I guess the problem is that "ł" is represented as single unicode character and there is no following NonSpacingMark.
Do you have any idea how I can fix it (without relying on custom mapping in some dictionary - I'm looking for some kind of unicode conversion)?
Here is my quick implementation of Polish stoplist with normalization of Polish diacritics.
class StopList
{
private HashSet<String> set = new HashSet<String>();
public void add(String word)
{
word = word.trim().toLowerCase();
word = normalize(word);
set.add(word);
}
public boolean contains(final String string)
{
return set.contains(string) || set.contains(normalize(string));
}
private char normalizeChar(final char c)
{
switch ( c)
{
case 'ą':
return 'a';
case 'ć':
return 'c';
case 'ę':
return 'e';
case 'ł':
return 'l';
case 'ń':
return 'n';
case 'ó':
return 'o';
case 'ś':
return 's';
case 'ż':
case 'ź':
return 'z';
}
return c;
}
private String normalize(final String word)
{
if (word == null || "".equals(word))
{
return word;
}
char[] charArray = word.toCharArray();
char[] normalizedArray = new char[charArray.length];
for (int i = 0; i < normalizedArray.length; i++)
{
normalizedArray[i] = normalizeChar(charArray[i]);
}
return new String(normalizedArray);
}
}
I couldnt find any other solution in the Net. So maybe it will be helpful for someone (?)
Some time ago I've come across this solution, which seems to work fine:
public static string RemoveDiacritics(this string s)
{
string asciiEquivalents = Encoding.ASCII.GetString(
Encoding.GetEncoding("Cyrillic").GetBytes(s)
);
return asciiEquivalents;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With