Java UTF-8 to ASCII conversion with supplements

Tags:

we are accepting all sorts of national characters in UTF-8 string on the input, and we need to convert them to ASCII string on the output for some legacy use. (we don't accept Chinese and Japanese chars, only European languages)

We have a small utility to get rid of all the diacritics:

public static final String toBaseCharacters(final String sText) {
    if (sText == null || sText.length() == 0)
        return sText;

    final char[] chars = sText.toCharArray();
    final int iSize = chars.length;
    final StringBuilder sb = new StringBuilder(iSize);

    for (int i = 0; i < iSize; i++) {
        String sLetter = new String(new char[] { chars[i] });
        sLetter = Normalizer.normalize(sLetter, Normalizer.Form.NFC);

        try {
            byte[] bLetter = sLetter.getBytes("UTF-8");
            sb.append((char) bLetter[0]);
        } catch (UnsupportedEncodingException e) {
        }
    }
    return sb.toString();
}

The question is how to replace all the german sharp s (ß, Đ, đ) and other characters that get through the above normalization method, with their supplements (in case of ß, supplement would probably be "ss" and in case od Đ supplement would be either "D" or "Dj").

Is there some simple way to do it, without million of .replaceAll() calls?

So for example: Đonardan = Djonardan, Blaß = Blass and so on.

We can replace all "problematic" chars with empty space, but would like to avoid this to make the output as similar to the input as possible.

Thank you for your answers,

Bozo

351

asked Mar 30 '10 12:03

bozo

2 Answers

You want to use ICU4J. It includes the com.ibm.icu.text.Transliterator class, which apparently can do what you are looking for.

129

answered Oct 01 '22 01:10

Thomas Pornin

Here's my converter which uses lucene...

private final KeywordTokenizer keywordTokenizer = new KeywordTokenizer(new StringReader(""));
private final ASCIIFoldingFilter asciiFoldingFilter = new ASCIIFoldingFilter(keywordTokenizer);
private final TermAttribute termAttribute = (TermAttribute) asciiFoldingFilter.getAttribute(TermAttribute.class);

public String process(String line)
{
    if (line != null)
    {
        try
        {
            keywordTokenizer.reset(new StringReader(line));
            if (asciiFoldingFilter.incrementToken())
            {
                return termAttribute.term();
            }
        }
        catch (IOException e)
        {
            logger.warn("Failed to parse: " + line, e);
        }
    }
    return null;
}

answered Oct 01 '22 01:10

neilireson

Related questions
                            
                                Define Mongo Schema Validation using Spring
                            
                                Specify either CPU or GPU for multiple models tensorflow java's job
                            
                                SpringBoot CommandLineRunner run() method not being called
                            
                                Kafka Streams error - Offset commit failed on partition, request timed out
                            
                                Apache Commons CLI DefaultParser NoSuchMethod error
                            
                                How make "fire and forget" request sending in spring webflux webclient?
                            
                                Facebook App Event sdk 4.38.0 throws NPE in ViewHierarchy.setAppearanceOfView
                            
                                TreeMap collection views iterators time-complexity?
                            
                                All @Version fields in AUD table are null when used hibernate-envers, but in entity - table filled okey?
                            
                                Upload ZipOutputStream to S3 without saving zip file (large) temporary to disk using AWS S3 Java
                            
                                Parsing files over 2.15 GB in Java using Kaitai Struct
                            
                                Why does Java compile to assembly twice? [duplicate]
                            
                                Can we directly throw ResponseStatusException from service layer without throwing custom exception and handling at the controller level?
                            
                                How do you write an Akka Typed Extension for Spring for constructor dependency injection?
                            
                                Embedded OSGi or Application Bundle
                            
                                Howto multithreaded jython scripts running from java?
                            
                                Proper way to set up transactions in Spring for different data sources?
                            
                                Excel spreadsheet like library in Swing (=improved JTable)
                            
                                OpenID consumer for JAVA GAE
                            
                                Java: fast disk-based hash set

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Java UTF-8 to ASCII conversion with supplements

Tags:

java

character-encoding

special-characters

bozo

People also ask

2 Answers

Thomas Pornin

neilireson

Recent Activity

Donate For Us