Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove diacritical marks (ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ) from Unicode chars

I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their "simple" character.

For example:

ń  ǹ  ň  ñ  ṅ  ņ  ṇ  ṋ  ṉ  ̈  ɲ  ƞ ᶇ ɳ ȵ  --> n á --> a ä --> a ấ --> a ṏ --> o 

Etc.

  1. I want to do this in Java, although I suspect it should be something Unicode-y and should be doable reasonably easily in any language.

  2. Purpose: to allow easily search for words with diacritical marks. For example, if I have a database of tennis players, and Björn_Borg is entered, I will also keep Bjorn_Borg so I can find it if someone enters Bjorn and not Björn.

like image 438
flybywire Avatar asked Sep 21 '09 07:09

flybywire


1 Answers

I have done this recently in Java:

public static final Pattern DIACRITICS_AND_FRIENDS     = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");  private static String stripDiacritics(String str) {     str = Normalizer.normalize(str, Normalizer.Form.NFD);     str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");     return str; } 

This will do as you specified:

stripDiacritics("Björn")  = Bjorn 

but it will fail on for example Białystok, because the ł character is not diacritic.

If you want to have a full-blown string simplifier, you will need a second cleanup round, for some more special characters that are not diacritics. Is this map, I have included the most common special characters that appear in our customer names. It is not a complete list, but it will give you the idea how to do extend it. The immutableMap is just a simple class from google-collections.

public class StringSimplifier {     public static final char DEFAULT_REPLACE_CHAR = '-';     public static final String DEFAULT_REPLACE = String.valueOf(DEFAULT_REPLACE_CHAR);     private static final ImmutableMap<String, String> NONDIACRITICS = ImmutableMap.<String, String>builder()          //Remove crap strings with no sematics         .put(".", "")         .put("\"", "")         .put("'", "")          //Keep relevant characters as seperation         .put(" ", DEFAULT_REPLACE)         .put("]", DEFAULT_REPLACE)         .put("[", DEFAULT_REPLACE)         .put(")", DEFAULT_REPLACE)         .put("(", DEFAULT_REPLACE)         .put("=", DEFAULT_REPLACE)         .put("!", DEFAULT_REPLACE)         .put("/", DEFAULT_REPLACE)         .put("\\", DEFAULT_REPLACE)         .put("&", DEFAULT_REPLACE)         .put(",", DEFAULT_REPLACE)         .put("?", DEFAULT_REPLACE)         .put("°", DEFAULT_REPLACE) //Remove ?? is diacritic?         .put("|", DEFAULT_REPLACE)         .put("<", DEFAULT_REPLACE)         .put(">", DEFAULT_REPLACE)         .put(";", DEFAULT_REPLACE)         .put(":", DEFAULT_REPLACE)         .put("_", DEFAULT_REPLACE)         .put("#", DEFAULT_REPLACE)         .put("~", DEFAULT_REPLACE)         .put("+", DEFAULT_REPLACE)         .put("*", DEFAULT_REPLACE)          //Replace non-diacritics as their equivalent characters         .put("\u0141", "l") // BiaLystock         .put("\u0142", "l") // Bialystock         .put("ß", "ss")         .put("æ", "ae")         .put("ø", "o")         .put("©", "c")         .put("\u00D0", "d") // All Ð ð from http://de.wikipedia.org/wiki/%C3%90         .put("\u00F0", "d")         .put("\u0110", "d")         .put("\u0111", "d")         .put("\u0189", "d")         .put("\u0256", "d")         .put("\u00DE", "th") // thorn Þ         .put("\u00FE", "th") // thorn þ         .build();       public static String simplifiedString(String orig) {         String str = orig;         if (str == null) {             return null;         }         str = stripDiacritics(str);         str = stripNonDiacritics(str);         if (str.length() == 0) {             // Ugly special case to work around non-existing empty strings             // in Oracle. Store original crapstring as simplified.             // It would return an empty string if Oracle could store it.             return orig;         }         return str.toLowerCase();     }      private static String stripNonDiacritics(String orig) {         StringBuilder ret = new StringBuilder         String lastchar = null;         for (int i = 0; i < orig.length(); i++) {             String source = orig.substring(i, i + 1);             String replace = NONDIACRITICS.get(source);             String toReplace = replace == null ? String.valueOf(source) : replace;             if (DEFAULT_REPLACE.equals(lastchar) && DEFAULT_REPLACE.equals(toReplace)) {                 toReplace = "";             } else {                 lastchar = toReplace;             }             ret.append(toReplace);         }         if (ret.length() > 0 && DEFAULT_REPLACE_CHAR == ret.charAt(ret.length() - 1)) {             ret.deleteCharAt(ret.length() - 1);         }         return ret.toString();     }      /*     Special regular expression character ranges relevant for simplification -> see http://docstore.mik.ua/orelly/perl/prog3/ch05_04.htm     InCombiningDiacriticalMarks: special marks that are part of "normal" ä, ö, î etc..         IsSk: Symbol, Modifier see http://www.fileformat.info/info/unicode/category/Sk/list.htm         IsLm: Letter, Modifier see http://www.fileformat.info/info/unicode/category/Lm/list.htm      */     public static final Pattern DIACRITICS_AND_FRIENDS         = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+");       private static String stripDiacritics(String str) {         str = Normalizer.normalize(str, Normalizer.Form.NFD);         str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll("");         return str;     } } 
like image 135
Andreas Petersson Avatar answered Sep 22 '22 14:09

Andreas Petersson