I am reading some text files in a Java program and would like to replace some Unicode characters with ASCII approximations. These files will eventually be broken into sentences that are fed to OpenNLP. OpenNLP does not recognize Unicode characters and gives improper results on a number of symbols (it tokenizes "girl's" as "girl" and "'s" but if it is a Unicode quote it is treated as a single token)..
For example, the source sentence may contain the Unicode directional quotation U2018 (‘) and I would like to convert that to U0027 ('). Eventually I will be stripping the remaining Unicode.
I understand that I am losing information, and I know that I could write regular expressions to convert each of these symbols, but I am asking if there is code I can reuse to convert some of these symbols.
This is what I could, but I'm sure I will make mistakes/miss things/etc.:
// double quotation (") replacements.add(new Replacement(Pattern.compile("[\u201c\u201d\u201e\u201f\u275d\u275e]"), "\"")); // single quotation (') replacements.add(new Replacement(Pattern.compile("[\u2018\u2019\u201a\u201b\u275b\u275c]"), "'"));
replacements is a custom class that I later run over and apply the replacements.
for (Replacement replacement : replacements) { text = replacement.pattern.matcher(text).replaceAll(r.replacement); }
As you can see, I had to find:
I found a pretty extensive table that maps Unicode punctuation to their closest ASCII equivalents.
Here's more info: Map Symbols & Punctuation to ASCII.
I followed @marek-stoj's link and created a Scala application that cleans unicode out of strings while maintaining the string length. It remove diacritics (accents) and uses the map suggested by @marek-stoj to convert non-Ascii unicode characters to their ascii approximations.
import java.text.Normalizer object Asciifier { def apply(string: String) = { var cleaned = string for ((unicode, ascii) <- substitutions) { cleaned = cleaned.replaceAll(unicode, ascii) } // convert diacritics to a two-character form (NFD) // http://docs.oracle.com/javase/tutorial/i18n/text/normalizerapi.html cleaned = Normalizer.normalize(cleaned, Normalizer.Form.NFD) // remove all characters that combine with the previous character // to form a diacritic. Also remove control characters. // http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html cleaned.replaceAll("[\\p{InCombiningDiacriticalMarks}\\p{Cntrl}]", "") // size must not change require(cleaned.size == string.size) cleaned } val substitutions = Set( (0x00AB, '"'), (0x00AD, '-'), (0x00B4, '\''), (0x00BB, '"'), (0x00F7, '/'), (0x01C0, '|'), (0x01C3, '!'), (0x02B9, '\''), (0x02BA, '"'), (0x02BC, '\''), (0x02C4, '^'), (0x02C6, '^'), (0x02C8, '\''), (0x02CB, '`'), (0x02CD, '_'), (0x02DC, '~'), (0x0300, '`'), (0x0301, '\''), (0x0302, '^'), (0x0303, '~'), (0x030B, '"'), (0x030E, '"'), (0x0331, '_'), (0x0332, '_'), (0x0338, '/'), (0x0589, ':'), (0x05C0, '|'), (0x05C3, ':'), (0x066A, '%'), (0x066D, '*'), (0x200B, ' '), (0x2010, '-'), (0x2011, '-'), (0x2012, '-'), (0x2013, '-'), (0x2014, '-'), (0x2015, '-'), (0x2016, '|'), (0x2017, '_'), (0x2018, '\''), (0x2019, '\''), (0x201A, ','), (0x201B, '\''), (0x201C, '"'), (0x201D, '"'), (0x201E, '"'), (0x201F, '"'), (0x2032, '\''), (0x2033, '"'), (0x2034, '\''), (0x2035, '`'), (0x2036, '"'), (0x2037, '\''), (0x2038, '^'), (0x2039, '<'), (0x203A, '>'), (0x203D, '?'), (0x2044, '/'), (0x204E, '*'), (0x2052, '%'), (0x2053, '~'), (0x2060, ' '), (0x20E5, '\\'), (0x2212, '-'), (0x2215, '/'), (0x2216, '\\'), (0x2217, '*'), (0x2223, '|'), (0x2236, ':'), (0x223C, '~'), (0x2264, '<'), (0x2265, '>'), (0x2266, '<'), (0x2267, '>'), (0x2303, '^'), (0x2329, '<'), (0x232A, '>'), (0x266F, '#'), (0x2731, '*'), (0x2758, '|'), (0x2762, '!'), (0x27E6, '['), (0x27E8, '<'), (0x27E9, '>'), (0x2983, '{'), (0x2984, '}'), (0x3003, '"'), (0x3008, '<'), (0x3009, '>'), (0x301B, ']'), (0x301C, '~'), (0x301D, '"'), (0x301E, '"'), (0xFEFF, ' ')).map { case (unicode, ascii) => (unicode.toChar.toString, ascii.toString) } }
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With