We have a system where customers, mainly European enter texts (in UTF-8) that has to be distributed to different systems, most of them accepting UTF-8, but now we must also distribute the texts to a US system which only accepts US-Ascii 7-bit
So now we'll need to translate all European characters to the nearest US-Ascii. Is there any Java libraries to help with this task?
Right now we've just started adding to a translation table, where Å (swedish AA)->A and so on and where we don't find any match for an entered character, we'll log it and replace with a question mark and try and fix that for the next release, but it seems very inefficient and somebody else must have done something similair before.
ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.
UTF-8 is not a character set but an encoding used with Unicode. It happens to be compatible with ASCII too, because the codes used for multiple byte encodings lie in the part of the ASCII character set that is unused.
In order to convert Unicode to UTF-8 in Java, we use the getBytes() method. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. Declaration - The getBytes() method is declared as follows.
You can do this with the following (from the NFD example in this Core Java Technology Tech Tip):
public static String decompose(String s) {
return java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+","");
}
The uni2ascii program is written in C, but you could probably convert it to Java with little effort. It contains a large table of approximations (implicitly, in the switch-case statements).
Be aware that there are no universally accepted approximations: Germans want you to replace Ä by AE, Finns and Swedes prefer just A. Your example of Å isn't obvious either: Swedes would probably just drop the ring and use A, but Danes and Norwegians might like the historically more correct AA better.
Instead of creating your own table, you could instead convert the text to normalization form D, where the characters are represented as a base character plus the diacritics (for instance, "á" will be replaced by "a" followed by a combining acute accent). You can then strip everything which is not an ASCII letter.
The tables still exist, but are now the ones from the Unicode standard.
You could also try NFKD instead of NFD, to catch even more cases.
References:
In response to the answer given by Joe Liversedge, the referenced Lucene ISOLatin1AccentFilter no longer exists :
It has been replaced by org.apache.lucene.analysis.ASCIIFoldingFilter :
This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Characters from the following Unicode blocks are converted; however, only those characters with reasonable ASCII alternatives are converted.
FYI -
This is typically useful in search applications. See the corresponding Lucene ISOLatin1AccentFilter implementation. This isn't really designed for plugging into a random local implementation, but does the trick.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With