How to replace/remove 4(+)-byte characters from a UTF-8 string in Java?

Question

Because MySQL 5.1 does not support 4 byte UTF-8 sequences, I need to replace/drop the 4 byte sequences in these strings.

I'm looking a clean way to replace these characters.

Apache libraries are replacing the characters with a question-mark is fine for this case, although ASCII equivalent would be nicer, of course.

N.B. The input is from external sources (e-mail names) and upgrading the database is not a solution at this point in time.

pvgoddijn · Accepted Answer

We ended up implementing the following method in Java for this problem. Basicaly replacing the characters with a higher codepoint then the last 3byte UTF-8 char.

The offset calculations are to make sure we stay on the unicode code points.

public static final String LAST_3_BYTE_UTF_CHAR = "\uFFFF";
public static final String REPLACEMENT_CHAR = "\uFFFD"; 

public static String toValid3ByteUTF8String(String s)  {
    final int length = s.length();
    StringBuilder b = new StringBuilder(length);
    for (int offset = 0; offset < length; ) {
       final int codepoint = s.codePointAt(offset);

       // do something with the codepoint
       if (codepoint > CharUtils.LAST_3_BYTE_UTF_CHAR.codePointAt(0)) {
           b.append(CharUtils.REPLACEMENT_CHAR);
       } else {
           if (Character.isValidCodePoint(codepoint)) {
               b.appendCodePoint(codepoint);
           } else {
               b.append(CharUtils.REPLACEMENT_CHAR);
           }
       }
       offset += Character.charCount(codepoint);
    }
    return b.toString();
}

slawek · Answer

Another simple solution is to use regular expression [^\u0000-\uFFFF]. For example in java:

text.replaceAll("[^\u0000-\uFFFF]", "\uFFFD");

How to replace/remove 4(+)-byte characters from a UTF-8 string in Java?

Tags:

pvgoddijn

2 Answers

pvgoddijn

slawek

Recent Activity

Donate For Us

How to replace/remove 4(+)-byte characters from a UTF-8 string in Java?

Tags:

pvgoddijn

2 Answers

pvgoddijn

slawek

Related questions

Recent Activity

Donate For Us