we are accepting all sorts of national characters in UTF-8 string on the input, and we need to convert them to ASCII string on the output for some legacy use. (we don't accept Chinese and Japanese chars, only European languages)
We have a small utility to get rid of all the diacritics:
public static final String toBaseCharacters(final String sText) {
if (sText == null || sText.length() == 0)
return sText;
final char[] chars = sText.toCharArray();
final int iSize = chars.length;
final StringBuilder sb = new StringBuilder(iSize);
for (int i = 0; i < iSize; i++) {
String sLetter = new String(new char[] { chars[i] });
sLetter = Normalizer.normalize(sLetter, Normalizer.Form.NFC);
try {
byte[] bLetter = sLetter.getBytes("UTF-8");
sb.append((char) bLetter[0]);
} catch (UnsupportedEncodingException e) {
}
}
return sb.toString();
}
The question is how to replace all the german sharp s (ß, Đ, đ) and other characters that get through the above normalization method, with their supplements (in case of ß, supplement would probably be "ss" and in case od Đ supplement would be either "D" or "Dj").
Is there some simple way to do it, without million of .replaceAll() calls?
So for example: Đonardan = Djonardan, Blaß = Blass and so on.
We can replace all "problematic" chars with empty space, but would like to avoid this to make the output as similar to the input as possible.
Thank you for your answers,
Bozo
Each character is represented by one to four bytes. UTF-8 is backward-compatible with ASCII and can represent any standard Unicode character. The first 128 UTF-8 characters precisely match the first 128 ASCII characters (numbered 0-127), meaning that existing ASCII text is already valid UTF-8.
Introduction. When working with Strings in Java, we oftentimes need to encode them to a specific charset, such as UTF-8. UTF-8 represents a variable-width character encoding that uses between one and four eight-bit bytes to represent all valid Unicode code points.
The native character encoding of the Java programming language is UTF-16. A charset in the Java platform therefore defines a mapping between sequences of sixteen-bit UTF-16 code units (that is, sequences of chars) and sequences of bytes.
In order to convert a String into UTF-8, we use the getBytes() method in Java. The getBytes() method encodes a String into a sequence of bytes and returns a byte array. where charsetName is the specific charset by which the String is encoded into an array of bytes.
You want to use ICU4J. It includes the com.ibm.icu.text.Transliterator
class, which apparently can do what you are looking for.
Here's my converter which uses lucene...
private final KeywordTokenizer keywordTokenizer = new KeywordTokenizer(new StringReader(""));
private final ASCIIFoldingFilter asciiFoldingFilter = new ASCIIFoldingFilter(keywordTokenizer);
private final TermAttribute termAttribute = (TermAttribute) asciiFoldingFilter.getAttribute(TermAttribute.class);
public String process(String line)
{
if (line != null)
{
try
{
keywordTokenizer.reset(new StringReader(line));
if (asciiFoldingFilter.incrementToken())
{
return termAttribute.term();
}
}
catch (IOException e)
{
logger.warn("Failed to parse: " + line, e);
}
}
return null;
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With