Possible Duplicate:
ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ --> n or Remove diacritical marks from unicode chars
How to remove diacritics from strings?
For example transform all á->a, č->c, etc. that would work for all languages.
I'm doing full-text search, and would need to ignore any diacritics on searched text.
Thanks
Use java. text. Normalizer to handle this for you. This will separate all of the accent marks from the characters.
String str2 is assigned \uFFFF which is the highest value in Unicode. To convert them into UTF-8, we use the getBytes(“UTF-8”) method.
\p{InCombiningDiacriticalMarks} is a Unicode block property. In JDK7, you will be able to write it using the two-part notation \p{Block=CombiningDiacriticalMarks} , which may be clearer to the reader. It is documented here in UAX#44: “The Unicode Character Database”.
Using API level 9+ you can use the Normalizer class, e.g.
String normalized = Normalizer.normalize("âbĉdêéè", Form.NFD)
.replaceAll("\\p{InCombiningDiacriticalMarks}+", "");
(Keysers linked answer looks better, it cleans more crap)
This would return "abcdeee"
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With