Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert Unicode to ASCII without changing the string length (in Java)

What is the best way to convert a string from Unicode to ASCII without changing it's length (that is very important in my case)? Also the characters without any conversion problems must be at the same positions as in the original string. So an "Ä" must be converted to "A" and not something cryptic that has more characters.

Edit:
@novalis - Such symbols (for example of asian languages) should just be converted to some placeholders. I am not too interested in those words or what they mean.

@MtnViewMark - I must preserve the number of all characters and the position of ASCII available characters under any circumstance.

Here some more info: I have some text mining tools that can only process ASCII strings. Most of the text that should be processed is in English, but some do contain non ASCII characters. I am not interested in those words, but I must be sure that the words I am interested in (those that only contain ASCII characters) are at the same positions after the string conversion.

like image 651
Zardoz Avatar asked Jan 19 '10 20:01

Zardoz


People also ask

Can Java strings handle Unicode character strings?

Internally in Java all strings are kept in Unicode. Since not all text received from users or the outside world is in unicode, your application may have to convert from non-unicode to unicode.

How do you convert a String with Unicode encoding to a String of letters?

String str1 = "\u0000"; String str2 = "\uFFFF"; String str1 is assigned \u0000 which is the lowest value in Unicode. String str2 is assigned \uFFFF which is the highest value in Unicode.

How do you escape Unicode characters in Java?

According to section 3.3 of the Java Language Specification (JLS) a unicode escape consists of a backslash character (\) followed by one or more 'u' characters and four hexadecimal digits. So for example \u000A will be treated as a line feed.

How do I convert Unicode to ASCII?

You CAN'T convert from Unicode to ASCII. Almost every character in Unicode cannot be expressed in ASCII, and those that can be expressed have exactly the same codepoints in ASCII as in UTF-8, which is probably what you have.


1 Answers

As Paul Taylor mentioned: there is issue with using Normalizer if you need the project to be compilable/runnable in pre-1.6 and also in 1.6 and higher java. You will get into troubles since Normalizer is in different packages (java.text.Normalizer (for 1.6) instead of sun.text.Normalizer (for pre-1.6)) and has different method-signature.

Usually it is recommended to use reflection to invoke appropriate Normalizer.normalize() method. (Example could be found here).
But if you don't want to put reflection-mess in your code, you can use icu4j library. It contains com.ibm.icu.text.Normalizer class with normalize() method that perform the same job as java.text.Normalizer/sun.text.Normalizer. Icu library has (should have) own implementation of Normalizer so you can share your project with library and that should be java-independent.
Disadvantage is that the icu library is quite big.

If you using Normalizer class just for removing accents/diacritics from Strings, there's also another way. You can use Apache commons lang library (ver. 3) that contains StringUtils with method stripAccents():

String noAccentsString = org.apache.commons.lang3.StringUtils.stripAccents(s);

Lang3 library probably use reflection to invoke appropriate Normalizer according to java version. So advantage is that you don't have reflection mess in your code.

like image 176
sporak Avatar answered Oct 26 '22 04:10

sporak