Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

icu4j cyrillic to latin

I'm trying to get Cyrillic words to be in latin so I can have them in urls. I use icu4j transliterator, but it still gives weird characters like this: Vilʹândimaa. It should be more like viljandimaa. When I copy that url these letters turn to %.. something useless.

Does anybody know how to get Cyrillic to a-z with icu4j?

UPDATE

Can't answer myself already but found this question that was very helpful: Converting Symbols, Accent Letters to English Alphabet

like image 828
ivar Avatar asked Apr 28 '11 12:04

ivar


1 Answers

Modify your identifier to do what you want. You can strip unwanted characters using a regular expression with the Remove transform.

For example, consider the string "'Eé математика":

"'E\u00E9 \u043c\u0430\u0442\u0435\u043c\u0430\u0442\u0438\u043a\u0430"

The identifier "Any-Latin; NFD; [^\\p{Alnum}] Remove" will transliterate to Latin (which may still include accents), decompose accented characters into the letter and diacritics and remove anything that isn't an alphanumeric. The resultant string is "Eematematika".

You can read more on the identifiers under General Transforms on the ICU website.


Example:

//import com.ibm.icu.text.Transliterator;
String greek
       = "'E\u00E9 \u043c\u0430\u0442\u0435\u043c\u0430\u0442\u0438\u043a\u0430";
String id = "Any-Latin; NFD; [^\\p{Alnum}] Remove";
String latin = Transliterator.getInstance(id)
                             .transform(greek);
System.out.println(latin);

Tested against ICU4J 49.1.

like image 153
McDowell Avatar answered Sep 25 '22 06:09

McDowell