Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to replace/remove 4(+)-byte characters from a UTF-8 string in Java?

Tags:

Because MySQL 5.1 does not support 4 byte UTF-8 sequences, I need to replace/drop the 4 byte sequences in these strings.

I'm looking a clean way to replace these characters.

Apache libraries are replacing the characters with a question-mark is fine for this case, although ASCII equivalent would be nicer, of course.

N.B. The input is from external sources (e-mail names) and upgrading the database is not a solution at this point in time.

like image 328
pvgoddijn Avatar asked Feb 13 '12 12:02

pvgoddijn


2 Answers

We ended up implementing the following method in Java for this problem. Basicaly replacing the characters with a higher codepoint then the last 3byte UTF-8 char.

The offset calculations are to make sure we stay on the unicode code points.

public static final String LAST_3_BYTE_UTF_CHAR = "\uFFFF";
public static final String REPLACEMENT_CHAR = "\uFFFD"; 

public static String toValid3ByteUTF8String(String s)  {
    final int length = s.length();
    StringBuilder b = new StringBuilder(length);
    for (int offset = 0; offset < length; ) {
       final int codepoint = s.codePointAt(offset);

       // do something with the codepoint
       if (codepoint > CharUtils.LAST_3_BYTE_UTF_CHAR.codePointAt(0)) {
           b.append(CharUtils.REPLACEMENT_CHAR);
       } else {
           if (Character.isValidCodePoint(codepoint)) {
               b.appendCodePoint(codepoint);
           } else {
               b.append(CharUtils.REPLACEMENT_CHAR);
           }
       }
       offset += Character.charCount(codepoint);
    }
    return b.toString();
}
like image 108
pvgoddijn Avatar answered Sep 22 '22 19:09

pvgoddijn


Another simple solution is to use regular expression [^\u0000-\uFFFF]. For example in java:

text.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");
like image 27
slawek Avatar answered Sep 21 '22 19:09

slawek