In my database I get the error
com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column
I use Java and MySQL 5. As I know 4-byte Unicode is legal i Java, but illegal in MySQL 5, I think that it can cause my problem and I want to check type of my data, so here's my question: How can i check that my UTF-8 data is 3-byte or 4-byte Unicode?
If our byte is positive (8th bit set to 0), this mean that it's an ASCII character. if ( myByte >= 0 ) return myByte; Codes greater than 127 are encoded into several bytes. On the other hand, if our byte is negative, this means that it's probably an UTF-8 encoded character whose code is greater than 127.
The Difference Between Unicode and UTF-8Unicode is a character set. UTF-8 is encoding. Unicode is a list of characters with unique decimal numbers (code points).
UTF-8 is an encoding system for Unicode. It can translate any Unicode character to a matching unique binary string, and can also translate the binary string back to a Unicode character. This is the meaning of “UTF”, or “Unicode Transformation Format.”
UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.
UTF-8 encodes everything in the basic multilingual plane (i.e. U+0000 to U+FFFF inclusive) in 1-3 bytes. Therefore, you just need to check whether everything in your string is in the BMP.
In Java, that means checking whether any char
(which is a UTF-16 code unit) is a high or low surrogate character, as Java will use surrogate pairs to encode non-BMP characters:
public static boolean isEntirelyInBasicMultilingualPlane(String text) {
for (int i = 0; i < text.length(); i++) {
if (Character.isSurrogate(text.charAt(i))) {
return false;
}
}
return true;
}
If you do not want to support beyond BMP, you can just strip those characters before handing it to MySQL:
public static String withNonBmpStripped( String input ) {
if( input == null ) throw new IllegalArgumentException("input");
return input.replaceAll("[^\\u0000-\\uFFFF]", "");
}
If you want to support beyond BMP, you need MySQL 5.5+ and you need to change everything that's utf8
to utf8mb4
(collations, charsets ...). But you also need the support for this in the driver that I am
not familiar with. Handling these characters in Java is also a pain because they are spread over 2 chars
and thus need special handling in many operations.
Best approach to strip non-BMP charactres in java that I found is the following:
inputString.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With