Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove diacritics from string in Java [duplicate]

Possible Duplicate:
ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ --> n or Remove diacritical marks from unicode chars

How to remove diacritics from strings?

For example transform all á->a, č->c, etc. that would work for all languages.

I'm doing full-text search, and would need to ignore any diacritics on searched text.

Thanks

like image 255
Pointer Null Avatar asked May 22 '12 10:05

Pointer Null


People also ask

How do I remove diacritics accents from a string in Java?

Use java. text. Normalizer to handle this for you. This will separate all of the accent marks from the characters.

What is uFFFF in Java?

String str2 is assigned \uFFFF which is the highest value in Unicode. To convert them into UTF-8, we use the getBytes(“UTF-8”) method.

What is InCombiningDiacriticalMarks?

\p{InCombiningDiacriticalMarks} is a Unicode block property. In JDK7, you will be able to write it using the two-part notation \p{Block=CombiningDiacriticalMarks} , which may be clearer to the reader. It is documented here in UAX#44: “The Unicode Character Database”.


1 Answers

Using API level 9+ you can use the Normalizer class, e.g.

String normalized = Normalizer.normalize("âbĉdêéè", Form.NFD)
    .replaceAll("\\p{InCombiningDiacriticalMarks}+", "");

(Keysers linked answer looks better, it cleans more crap)

This would return "abcdeee".

like image 88
Jens Avatar answered Oct 28 '22 15:10

Jens