Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to convert UTF-8 to US-Ascii in Java

Tags:

java

ascii

utf-8

We have a system where customers, mainly European enter texts (in UTF-8) that has to be distributed to different systems, most of them accepting UTF-8, but now we must also distribute the texts to a US system which only accepts US-Ascii 7-bit

So now we'll need to translate all European characters to the nearest US-Ascii. Is there any Java libraries to help with this task?

Right now we've just started adding to a translation table, where Å (swedish AA)->A and so on and where we don't find any match for an entered character, we'll log it and replace with a question mark and try and fix that for the next release, but it seems very inefficient and somebody else must have done something similair before.

like image 377
Ulf Lindback Avatar asked Nov 12 '08 20:11

Ulf Lindback


People also ask

Is US ASCII the same as UTF-8?

ASCII is a subset of UTF-8, so all ASCII files are already UTF-8 encoded. The bytes in the ASCII file and the bytes that would result from "encoding it to UTF-8" would be exactly the same bytes. There's no difference between them, so there's no need to do anything.

Can UTF-8 be read as ASCII?

UTF-8 is not a character set but an encoding used with Unicode. It happens to be compatible with ASCII too, because the codes used for multiple byte encodings lie in the part of the ASCII character set that is unused.

How do I convert to UTF-8 in Java?

In order to convert Unicode to UTF-8 in Java, we use the getBytes() method. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. Declaration - The getBytes() method is declared as follows.


5 Answers

You can do this with the following (from the NFD example in this Core Java Technology Tech Tip):

public static String decompose(String s) {
    return java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+","");
}
like image 139
Simon Lieschke Avatar answered Sep 28 '22 11:09

Simon Lieschke


The uni2ascii program is written in C, but you could probably convert it to Java with little effort. It contains a large table of approximations (implicitly, in the switch-case statements).

Be aware that there are no universally accepted approximations: Germans want you to replace Ä by AE, Finns and Swedes prefer just A. Your example of Å isn't obvious either: Swedes would probably just drop the ring and use A, but Danes and Norwegians might like the historically more correct AA better.

like image 37
Jouni K. Seppänen Avatar answered Sep 28 '22 11:09

Jouni K. Seppänen


Instead of creating your own table, you could instead convert the text to normalization form D, where the characters are represented as a base character plus the diacritics (for instance, "á" will be replaced by "a" followed by a combining acute accent). You can then strip everything which is not an ASCII letter.

The tables still exist, but are now the ones from the Unicode standard.

You could also try NFKD instead of NFD, to catch even more cases.

References:

  • http://unicode.org/reports/tr15/
  • http://www.siao2.com/2005/02/19/376617.aspx
  • http://www.siao2.com/2007/05/14/2629747.aspx
like image 42
CesarB Avatar answered Sep 28 '22 12:09

CesarB


In response to the answer given by Joe Liversedge, the referenced Lucene ISOLatin1AccentFilter no longer exists :

It has been replaced by org.apache.lucene.analysis.ASCIIFoldingFilter :

This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Characters from the following Unicode blocks are converted; however, only those characters with reasonable ASCII alternatives are converted.

FYI -

like image 44
Matt Storer Avatar answered Sep 28 '22 12:09

Matt Storer


This is typically useful in search applications. See the corresponding Lucene ISOLatin1AccentFilter implementation. This isn't really designed for plugging into a random local implementation, but does the trick.

like image 26
Joe Liversedge Avatar answered Sep 28 '22 13:09

Joe Liversedge