I am looking at an algorithm that can map between characters with diacritics (tilde, circumflex, caret, umlaut, caron) and their "simple" character.
For example:
ń ǹ ň ñ ṅ ņ ṇ ṋ ṉ ̈ ɲ ƞ ᶇ ɳ ȵ --> n á --> a ä --> a ấ --> a ṏ --> o
Etc.
I want to do this in Java, although I suspect it should be something Unicode-y and should be doable reasonably easily in any language.
Purpose: to allow easily search for words with diacritical marks. For example, if I have a database of tennis players, and Björn_Borg is entered, I will also keep Bjorn_Borg so I can find it if someone enters Bjorn and not Björn.
I have done this recently in Java:
public static final Pattern DIACRITICS_AND_FRIENDS = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+"); private static String stripDiacritics(String str) { str = Normalizer.normalize(str, Normalizer.Form.NFD); str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll(""); return str; }
This will do as you specified:
stripDiacritics("Björn") = Bjorn
but it will fail on for example Białystok, because the ł
character is not diacritic.
If you want to have a full-blown string simplifier, you will need a second cleanup round, for some more special characters that are not diacritics. Is this map, I have included the most common special characters that appear in our customer names. It is not a complete list, but it will give you the idea how to do extend it. The immutableMap is just a simple class from google-collections.
public class StringSimplifier { public static final char DEFAULT_REPLACE_CHAR = '-'; public static final String DEFAULT_REPLACE = String.valueOf(DEFAULT_REPLACE_CHAR); private static final ImmutableMap<String, String> NONDIACRITICS = ImmutableMap.<String, String>builder() //Remove crap strings with no sematics .put(".", "") .put("\"", "") .put("'", "") //Keep relevant characters as seperation .put(" ", DEFAULT_REPLACE) .put("]", DEFAULT_REPLACE) .put("[", DEFAULT_REPLACE) .put(")", DEFAULT_REPLACE) .put("(", DEFAULT_REPLACE) .put("=", DEFAULT_REPLACE) .put("!", DEFAULT_REPLACE) .put("/", DEFAULT_REPLACE) .put("\\", DEFAULT_REPLACE) .put("&", DEFAULT_REPLACE) .put(",", DEFAULT_REPLACE) .put("?", DEFAULT_REPLACE) .put("°", DEFAULT_REPLACE) //Remove ?? is diacritic? .put("|", DEFAULT_REPLACE) .put("<", DEFAULT_REPLACE) .put(">", DEFAULT_REPLACE) .put(";", DEFAULT_REPLACE) .put(":", DEFAULT_REPLACE) .put("_", DEFAULT_REPLACE) .put("#", DEFAULT_REPLACE) .put("~", DEFAULT_REPLACE) .put("+", DEFAULT_REPLACE) .put("*", DEFAULT_REPLACE) //Replace non-diacritics as their equivalent characters .put("\u0141", "l") // BiaLystock .put("\u0142", "l") // Bialystock .put("ß", "ss") .put("æ", "ae") .put("ø", "o") .put("©", "c") .put("\u00D0", "d") // All Ð ð from http://de.wikipedia.org/wiki/%C3%90 .put("\u00F0", "d") .put("\u0110", "d") .put("\u0111", "d") .put("\u0189", "d") .put("\u0256", "d") .put("\u00DE", "th") // thorn Þ .put("\u00FE", "th") // thorn þ .build(); public static String simplifiedString(String orig) { String str = orig; if (str == null) { return null; } str = stripDiacritics(str); str = stripNonDiacritics(str); if (str.length() == 0) { // Ugly special case to work around non-existing empty strings // in Oracle. Store original crapstring as simplified. // It would return an empty string if Oracle could store it. return orig; } return str.toLowerCase(); } private static String stripNonDiacritics(String orig) { StringBuilder ret = new StringBuilder String lastchar = null; for (int i = 0; i < orig.length(); i++) { String source = orig.substring(i, i + 1); String replace = NONDIACRITICS.get(source); String toReplace = replace == null ? String.valueOf(source) : replace; if (DEFAULT_REPLACE.equals(lastchar) && DEFAULT_REPLACE.equals(toReplace)) { toReplace = ""; } else { lastchar = toReplace; } ret.append(toReplace); } if (ret.length() > 0 && DEFAULT_REPLACE_CHAR == ret.charAt(ret.length() - 1)) { ret.deleteCharAt(ret.length() - 1); } return ret.toString(); } /* Special regular expression character ranges relevant for simplification -> see http://docstore.mik.ua/orelly/perl/prog3/ch05_04.htm InCombiningDiacriticalMarks: special marks that are part of "normal" ä, ö, î etc.. IsSk: Symbol, Modifier see http://www.fileformat.info/info/unicode/category/Sk/list.htm IsLm: Letter, Modifier see http://www.fileformat.info/info/unicode/category/Lm/list.htm */ public static final Pattern DIACRITICS_AND_FRIENDS = Pattern.compile("[\\p{InCombiningDiacriticalMarks}\\p{IsLm}\\p{IsSk}]+"); private static String stripDiacritics(String str) { str = Normalizer.normalize(str, Normalizer.Form.NFD); str = DIACRITICS_AND_FRIENDS.matcher(str).replaceAll(""); return str; } }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With