Converting Symbols, Accent Letters to English Alphabet

Question

Reposting my post from How do I remove diacritics (accents) from a string in .NET?

This method works fine in java (purely for the purpose of removing diacritical marks aka accents).

It basically converts all accented characters into their deAccented counterparts followed by their combining diacritics. Now you can use a regex to strip off the diacritics.

import java.text.Normalizer;
import java.util.regex.Pattern;

public String deAccent(String str) {
    String nfdNormalizedString = Normalizer.normalize(str, Normalizer.Form.NFD); 
    Pattern pattern = Pattern.compile("\p{InCombiningDiacriticalMarks}+");
    return pattern.matcher(nfdNormalizedString).replaceAll("");
}

It's a part of Apache Commons Lang as of ver. 3.0.

org.apache.commons.lang3.StringUtils.stripAccents("Añ");

returns An

Also see http://www.drillio.com/en/software-development/java/removing-accents-diacritics-in-any-language/

Attempting to "convert them all" is the wrong approach to the problem.

Firstly, you need to understand the limitations of what you are trying to do. As others have pointed out, diacritics are there for a reason: they are essentially unique letters in the alphabet of that language with their own meaning / sound etc.: removing those marks is just the same as replacing random letters in an English word. This is before you even go onto consider the Cyrillic languages and other script based texts such as Arabic, which simply cannot be "converted" to English.

If you must, for whatever reason, convert characters, then the only sensible way to approach this it to firstly reduce the scope of the task at hand. Consider the source of the input - if you are coding an application for "the Western world" (to use as good a phrase as any), it would be unlikely that you would ever need to parse Arabic characters. Similarly, the Unicode character set contains hundreds of mathematical and pictorial symbols: there is no (easy) way for users to directly enter these, so you can assume they can be ignored.

By taking these logical steps you can reduce the number of possible characters to parse to the point where a dictionary based lookup / replace operation is feasible. It then becomes a small amount of slightly boring work creating the dictionaries, and a trivial task to perform the replacement. If your language supports native Unicode characters (as Java does) and optimises static structures correctly, such find and replaces tend to be blindingly quick.

This comes from experience of having worked on an application that was required to allow end users to search bibliographic data that included diacritic characters. The lookup arrays (as it was in our case) took perhaps 1 man day to produce, to cover all diacritic marks for all Western European languages.

Since the encoding that turns "the Family" into "tђє Ŧค๓เℓy" is effectively random and not following any algorithm that can be explained by the information of the Unicode codepoints involved, there's no general way to solve this algorithmically.

You will need to build the mapping of Unicode characters into latin characters which they resemble. You could probably do this with some smart machine learning on the actual glyphs representing the Unicode codepoints. But I think the effort for this would be greater than manually building that mapping. Especially if you have a good amount of examples from which you can build your mapping.

To clarify: a few of the substitutions can actually be solved via the Unicode data (as the other answers demonstrate), but some letters simply have no reasonable association with the latin characters which they resemble.

Examples:

"ђ" (U+0452 CYRILLIC SMALL LETTER DJE) is more related to "d" than to "h", but is used to represent "h".
"Ŧ" (U+0166 LATIN CAPITAL LETTER T WITH STROKE) is somewhat related to "T" (as the name suggests) but is used to represent "F".
"ค" (U+0E04 THAI CHARACTER KHO KHWAI) is not related to any latin character at all and in your example is used to represent "a"

String tested : ÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝß

Tested :

Output from Apache Commons Lang3 : AAAAAÆCEEEEIIIIÐNOOOOOØUUUUYß
Output from ICU4j : AAAAAÆCEEEEIIIIÐNOOOOOØUUUUYß
Output from JUnidecode : AAAAAAECEEEEIIIIDNOOOOOOUUUUUss (problem with Ý and another issue)
Output from Unidecode : AAAAAAECEEEEIIIIDNOOOOOOUUUUYss

The last choice is the best.

The original request has been answered already.

However, I am posting the below answer for those who might be looking for generic transliteration code to transliterate any charset to Latin/English in Java.

Naive meaning of tranliteration: Translated string in it's final form/target charset sounds like the string in it's original form. If we want to transliterate any charset to Latin(English alphabets), then ICU4(ICU4J library in java ) will do the job.

Here is the code snippet in java:

    import com.ibm.icu.text.Transliterator; //ICU4J library import

    public static String TRANSLITERATE_ID = "NFD; Any-Latin; NFC";
    public static String NORMALIZE_ID = "NFD; [:Nonspacing Mark:] Remove; NFC";

    /**
    * Returns the transliterated string to convert any charset to latin.
    */
    public static String transliterate(String input) {
        Transliterator transliterator = Transliterator.getInstance(TRANSLITERATE_ID + "; " + NORMALIZE_ID);
        String result = transliterator.transliterate(input);
        return result;
    }

If the need is to convert "òéışöç->oeisoc", you can use this a starting point :

public class AsciiUtils {
    private static final String PLAIN_ASCII =
      "AaEeIiOoUu"    // grave
    + "AaEeIiOoUuYy"  // acute
    + "AaEeIiOoUuYy"  // circumflex
    + "AaOoNn"        // tilde
    + "AaEeIiOoUuYy"  // umlaut
    + "Aa"            // ring
    + "Cc"            // cedilla
    + "OoUu"          // double acute
    ;

    private static final String UNICODE =
     "\u00C0\u00E0\u00C8\u00E8\u00CC\u00EC\u00D2\u00F2\u00D9\u00F9"             
    + "\u00C1\u00E1\u00C9\u00E9\u00CD\u00ED\u00D3\u00F3\u00DA\u00FA\u00DD\u00FD" 
    + "\u00C2\u00E2\u00CA\u00EA\u00CE\u00EE\u00D4\u00F4\u00DB\u00FB\u0176\u0177" 
    + "\u00C3\u00E3\u00D5\u00F5\u00D1\u00F1"
    + "\u00C4\u00E4\u00CB\u00EB\u00CF\u00EF\u00D6\u00F6\u00DC\u00FC\u0178\u00FF" 
    + "\u00C5\u00E5"                                                             
    + "\u00C7\u00E7" 
    + "\u0150\u0151\u0170\u0171" 
    ;

    // private constructor, can't be instanciated!
    private AsciiUtils() { }

    // remove accentued from a string and replace with ascii equivalent
    public static String convertNonAscii(String s) {
       if (s == null) return null;
       StringBuilder sb = new StringBuilder();
       int n = s.length();
       for (int i = 0; i < n; i++) {
          char c = s.charAt(i);
          int pos = UNICODE.indexOf(c);
          if (pos > -1){
              sb.append(PLAIN_ASCII.charAt(pos));
          }
          else {
              sb.append(c);
          }
       }
       return sb.toString();
    }

    public static void main(String args[]) {
       String s = 
         "The result : È,É,Ê,Ë,Û,Ù,Ï,Î,À,Â,Ô,è,é,ê,ë,û,ù,ï,î,à,â,ô,ç";
       System.out.println(AsciiUtils.convertNonAscii(s));
       // output : 
       // The result : E,E,E,E,U,U,I,I,A,A,O,e,e,e,e,u,u,i,i,a,a,o,c
    }
}

The JDK 1.6 provides the java.text.Normalizer class that can be used for this task.

See an example here

Converting Symbols, Accent Letters to English Alphabet

Tags:

java

unicode

diacritics

special-characters

Recent Activity

Donate For Us

Converting Symbols, Accent Letters to English Alphabet

Tags:

java

unicode

diacritics

special-characters

Related questions

Recent Activity

Donate For Us