Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Checking UTF-8 data type 3-byte, or 4-byte Unicode

In my database I get the error

com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column

I use Java and MySQL 5. As I know 4-byte Unicode is legal i Java, but illegal in MySQL 5, I think that it can cause my problem and I want to check type of my data, so here's my question: How can i check that my UTF-8 data is 3-byte or 4-byte Unicode?

like image 634
akuzma Avatar asked Feb 20 '13 13:02

akuzma


People also ask

How do I know my UTF-8 encoding?

If our byte is positive (8th bit set to 0), this mean that it's an ASCII character. if ( myByte >= 0 ) return myByte; Codes greater than 127 are encoded into several bytes. On the other hand, if our byte is negative, this means that it's probably an UTF-8 encoded character whose code is greater than 127.

Are Unicode and UTF-8 the same?

The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).

Does UTF-8 include Unicode?

UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”

How many bytes is a UTF-8?

UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.


3 Answers

UTF-8 encodes everything in the basic multilingual plane (i.e. U+0000 to U+FFFF inclusive) in 1-3 bytes. Therefore, you just need to check whether everything in your string is in the BMP.

In Java, that means checking whether any char (which is a UTF-16 code unit) is a high or low surrogate character, as Java will use surrogate pairs to encode non-BMP characters:

public static boolean isEntirelyInBasicMultilingualPlane(String text) {
    for (int i = 0; i < text.length(); i++) {
        if (Character.isSurrogate(text.charAt(i))) {
            return false;
        }
    }
    return true;
}
like image 189
Jon Skeet Avatar answered Oct 17 '22 05:10

Jon Skeet


If you do not want to support beyond BMP, you can just strip those characters before handing it to MySQL:

public static String withNonBmpStripped( String input ) {
    if( input == null ) throw new IllegalArgumentException("input");
    return input.replaceAll("[^\\u0000-\\uFFFF]", "");
}

If you want to support beyond BMP, you need MySQL 5.5+ and you need to change everything that's utf8 to utf8mb4 (collations, charsets ...). But you also need the support for this in the driver that I am not familiar with. Handling these characters in Java is also a pain because they are spread over 2 chars and thus need special handling in many operations.

like image 39
Esailija Avatar answered Oct 17 '22 05:10

Esailija


Best approach to strip non-BMP charactres in java that I found is the following:

inputString.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");
like image 6
verglor Avatar answered Oct 17 '22 03:10

verglor