Checking UTF-8 data type 3-byte, or 4-byte Unicode

Tags:

In my database I get the error

com.mysql.jdbc.MysqlDataTruncation: Data truncation: Data too long for column

I use Java and MySQL 5. As I know 4-byte Unicode is legal i Java, but illegal in MySQL 5, I think that it can cause my problem and I want to check type of my data, so here's my question: How can i check that my UTF-8 data is 3-byte or 4-byte Unicode?

634

asked Feb 20 '13 13:02

akuzma

3 Answers

UTF-8 encodes everything in the basic multilingual plane (i.e. U+0000 to U+FFFF inclusive) in 1-3 bytes. Therefore, you just need to check whether everything in your string is in the BMP.

In Java, that means checking whether any char (which is a UTF-16 code unit) is a high or low surrogate character, as Java will use surrogate pairs to encode non-BMP characters:

public static boolean isEntirelyInBasicMultilingualPlane(String text) {
    for (int i = 0; i < text.length(); i++) {
        if (Character.isSurrogate(text.charAt(i))) {
            return false;
        }
    }
    return true;
}

189

answered Oct 17 '22 05:10

Jon Skeet

If you do not want to support beyond BMP, you can just strip those characters before handing it to MySQL:

public static String withNonBmpStripped( String input ) {
    if( input == null ) throw new IllegalArgumentException("input");
    return input.replaceAll("[^\\u0000-\\uFFFF]", "");
}

If you want to support beyond BMP, you need MySQL 5.5+ and you need to change everything that's utf8 to utf8mb4 (collations, charsets ...). But you also need the support for this in the driver that I am not familiar with. Handling these characters in Java is also a pain because they are spread over 2 chars and thus need special handling in many operations.

answered Oct 17 '22 05:10

Esailija

Best approach to strip non-BMP charactres in java that I found is the following:

inputString.replaceAll("[^\\u0000-\\uFFFF]", "\uFFFD");

answered Oct 17 '22 03:10

verglor

Related questions
                            
                                Java equals() ordering
                            
                                why is bufferedwriter not writing in the file?
                            
                                AlertDialog style buttons for an Activity
                            
                                Can a secret be hidden in a 'safe' java class offering access credentials?
                            
                                Android game rpg inventory system
                            
                                Is it possible to use struct-like constructs in Java?
                            
                                How to sort HashMap based on Date? [duplicate]
                            
                                How to program without side-effects in Java?
                            
                                Does it make sense to self check for null in Java [closed]
                            
                                OSMDroid PathOverlay
                            
                                How to get Resource(int) from String - Android [duplicate]
                            
                                JList.getModel() ClassCastException
                            
                                Java Code for permutations of a list of numbers
                            
                                Does the best practice of 'programming to interfaces' apply to local variables?
                            
                                how to unproxy a hibernate object [duplicate]
                            
                                Case Insensitive variable for String replaceAll(,) method Java
                            
                                JUnit throws java.lang.NoSuchMethodError For com.google.common.collect.Iterables.tryFind
                            
                                Getting JVM error after SOAP UI installation
                            
                                Parsing raw HTTP Request
                            
                                how to decode html codes using Java? [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Checking UTF-8 data type 3-byte, or 4-byte Unicode

Tags:

java

mysql

character-encoding

unicode

utf-8