Is there a way to achieve transliteration of characters between charsets in java? something similar to the unix command (or similar php function):
iconv -f UTF-8 -t ASCII//TRANSLIT < some_doc.txt > new_doc.txt
preferably operating on strings, not having anything to do with files
I know you can can change encodings with the String
constructor, but that doesn't handle transliteration of characters that aren't in the resulting charset.
I'm not aware of any libraries that do exactly what iconv
purports to do (which doesn't seem very well defined). However, you can use "normalization" in Java to do things like remove accents from characters. This process is well defined by Unicode standards.
I think NFKD (compatibility decomposition) followed by a filtering of non-ASCII characters might get you close to what you want. Obviously, this is a lossy process; you can never recover all of the information that was in the original string, so be careful.
/* Decompose original "accented" string to basic characters. */
String decomposed = Normalizer.normalize(accented, Normalizer.Form.NFKD);
/* Build a new String with only ASCII characters. */
StringBuilder buf = new StringBuilder();
for (int idx = 0; idx < decomposed.length(); ++idx) {
char ch = decomposed.charAt(idx);
if (ch < 128)
buf.append(ch);
}
String filtered = buf.toString();
With the filtering used here, you might render some strings unreadable. For example, a string of Chinese characters would be filtered away completely because none of them have an ASCII representation (this is more like iconv's //IGNORE
).
Overall, it would be safer to build your own lookup table of valid character substitutions, or at least of combining characters (accents and things) that are safe to strip. The best solution depends on the range of input characters you expect to handle.
One solution is to execute execute iconv as an external process. It will certainly offend purists. It depends on presence of iconv on the system but it works and does exactly what you want:
public static String utfToAscii(String input) throws IOException {
Process p = Runtime.getRuntime().exec("iconv -f UTF-8 -t ASCII//TRANSLIT");
BufferedWriter bwo = new BufferedWriter(new OutputStreamWriter(p.getOutputStream()));
BufferedReader bri = new BufferedReader(new InputStreamReader(p.getInputStream()));
bwo.write(input,0,input.length());
bwo.flush();
bwo.close();
String line = null;
StringBuilder stringBuilder = new StringBuilder();
String ls = System.getProperty("line.separator");
while( ( line = bri.readLine() ) != null ) {
stringBuilder.append( line );
stringBuilder.append( ls );
}
bri.close();
try {
p.waitFor();
} catch ( InterruptedException e ) {
}
return stringBuilder.toString();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With