We have a system where customers, mainly European enter texts (in UTF-8) that has to be distributed to different systems, most of them accepting UTF-8, but now we must also distribute the texts to a US system which only accepts US-Ascii 7-bit So now we'll need to translate all European characters to the nearest US-Ascii. Is there any Java libraries to help with this task? Right now we've just started adding to a translation table, where Å (swedish AA)->A and so on and where we don't find any match for an entered character, we'll log it and replace with a question mark and try and fix that for the next release, but it seems very inefficient and somebody else must have done something similair before.

You can do this with the following (from the NFD example in this Core Java Technology Tech Tip): <pre class="prettyprint"><code>public static String decompose(String s) { return java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+",""); } </code></pre>

Instead of creating your own table, you could instead convert the text to normalization form D, where the characters are represented as a base character plus the diacritics (for instance, "á" will be replaced by "a" followed by a combining acute accent). You can then strip everything which is not an ASCII letter. The tables still exist, but are now the ones from the Unicode standard. You could also try NFKD instead of NFD, to catch even more cases. References: <ul> <li>http://unicode.org/reports/tr15/</li> <li>http://www.siao2.com/2005/02/19/376617.aspx</li> <li>http://www.siao2.com/2007/05/14/2629747.aspx</li> </ul>

How to convert UTF-8 to US-Ascii in Java

Tags:

java

ascii

utf-8

We have a system where customers, mainly European enter texts (in UTF-8) that has to be distributed to different systems, most of them accepting UTF-8, but now we must also distribute the texts to a US system which only accepts US-Ascii 7-bit

So now we'll need to translate all European characters to the nearest US-Ascii. Is there any Java libraries to help with this task?

Right now we've just started adding to a translation table, where Å (swedish AA)->A and so on and where we don't find any match for an entered character, we'll log it and replace with a question mark and try and fix that for the next release, but it seems very inefficient and somebody else must have done something similair before.

377

asked Nov 12 '08 20:11

Ulf Lindback

5 Answers

You can do this with the following (from the NFD example in this Core Java Technology Tech Tip):

public static String decompose(String s) {
    return java.text.Normalizer.normalize(s, java.text.Normalizer.Form.NFD).replaceAll("\\p{InCombiningDiacriticalMarks}+","");
}

139

answered Sep 28 '22 11:09

Simon Lieschke

The uni2ascii program is written in C, but you could probably convert it to Java with little effort. It contains a large table of approximations (implicitly, in the switch-case statements).

Be aware that there are no universally accepted approximations: Germans want you to replace Ä by AE, Finns and Swedes prefer just A. Your example of Å isn't obvious either: Swedes would probably just drop the ring and use A, but Danes and Norwegians might like the historically more correct AA better.

answered Sep 28 '22 11:09

Jouni K. Seppänen

Instead of creating your own table, you could instead convert the text to normalization form D, where the characters are represented as a base character plus the diacritics (for instance, "á" will be replaced by "a" followed by a combining acute accent). You can then strip everything which is not an ASCII letter.

The tables still exist, but are now the ones from the Unicode standard.

You could also try NFKD instead of NFD, to catch even more cases.

References:

http://unicode.org/reports/tr15/
http://www.siao2.com/2005/02/19/376617.aspx
http://www.siao2.com/2007/05/14/2629747.aspx

answered Sep 28 '22 12:09

CesarB

In response to the answer given by Joe Liversedge, the referenced Lucene ISOLatin1AccentFilter no longer exists :

It has been replaced by org.apache.lucene.analysis.ASCIIFoldingFilter :

This class converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Characters from the following Unicode blocks are converted; however, only those characters with reasonable ASCII alternatives are converted.

FYI -

answered Sep 28 '22 12:09

Matt Storer

This is typically useful in search applications. See the corresponding Lucene ISOLatin1AccentFilter implementation. This isn't really designed for plugging into a random local implementation, but does the trick.

answered Sep 28 '22 13:09

Joe Liversedge

Related questions
                            
                                @Pattern for alphanumeric string - Bean validation
                            
                                Time based triggering policy in log4j2
                            
                                Reason HashMap does not implement Iterable interface? [closed]
                            
                                JSR-303 validation groups define a default group
                            
                                Spring Boot - repository field required a bean named 'entityManagerFactory' that could not be found
                            
                                Mocking a method which returns Page interface
                            
                                block()/blockFirst()/blockLast() are blocking error when calling bodyToMono AFTER exchange()
                            
                                Is DocumentBuilder.parse() thread safe?
                            
                                Java detect lost connection [duplicate]
                            
                                DocumentBuilder.parse(InputStream) returns null
                            
                                How do I perform an unsigned right shift (>>> in Java) in C/C++?
                            
                                JFrame: get size without borders?
                            
                                Adding folder in eclipse in src directory without making it package
                            
                                Spring forward with added parameters?
                            
                                JAXB, XJC -> create multiple class files
                            
                                How to force subclasses to set a variable in java?
                            
                                Pass Hidden parameters using response.sendRedirect()
                            
                                What is the difference between constant variables and final variables in java?
                            
                                Get language name in that language from language code [duplicate]
                            
                                Is it possible to match nested brackets with a regex without using recursion or balancing groups?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With