I have a text-area on website where user can write anything. Problem happens when user copy paste some text or something which contains non-UTF 8 characters and submit them to server.
Java successfully handles it, as it support UTF-16 but my mySql table support UTF-8 and thus insertion fails.
I was trying to implement some way in business logic itself, to remove any characters which is not suitable for UTF-8 encoding.
Currently I am using this code:
new String(java.nio.charset.Charset.forName("UTF-8").encode(myString).array());
But it replaces characters not suitable for UTF-8 with some other obscure characters. Which also does not look good to end user. Could someone please throw some light over any possible solution to tackle this using Java code?
EDIT : For example, exception I got while insertion of such values
java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x8A\x0D\x0A...' for column
java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x80\xF0\x9F...' for column
UTF-8 is not a character set, it's a character encoding, just like UTF-16.
UTF-8 is capable to encode any unicode character and any unicode text to a sequence of bytes, so there is no such thing as characters not suitable for UTF-8.
You're using a constructor of String
which only takes a byte array (String(byte[] bytes)) which according to the javadocs:
Constructs a new String by decoding the specified array of bytes using the platform's default charset.
It uses the default charset of the platform to interpret the bytes (to convert the bytes to characters). Do not use this. Instead when converting a byte array to String
, specify the encoding you wish to use explicitly with the String(byte[] bytes, Charset charset) constructor.
If you have issues with certain characters, that is most likely due to using different character sets or encodings at the server side and at the client side (brownser+HTML). Make sure you use UTF-8 everywhere, do not mix encodings and do not use the default encoding of the platform.
Some readings how to achieve this:
How to get UTF-8 working in Java webapps?
Maybe the answer with the CharsetDecoder of this question helps. You could change the CodingErrorAction to REPLACE and set a replacement in my example "?". This will output a given replacement string for invalid byte sequences. In this example a UTF-8 decoder capability and stress test file is read and decoded:
CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.REPLACE);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
utf8Decoder.replaceWith("?");
// Read stress file
Path path = Paths.get("<path>/UTF-8-test.txt");
byte[] data = Files.readAllBytes(path);
ByteBuffer input = ByteBuffer.wrap(data);
// UTF-8 decoding
CharBuffer output = utf8Decoder.decode(input);
// Char buffer to string
String outputString = output.toString();
System.out.println(outputString);
You will run into this problem when the MySQL column is encoded with old utf8
using only 3 bytes per character and the value contains a 4-byte character.
The actual solution is to use utf8mb4
instead of utf8
in MySQL.
Otherwise here is my dirty workaround to remove all 4-byte chars:
public String removeUtf8Mb4(String text) {
StringBuilder result = new StringBuilder();
StringTokenizer st = new StringTokenizer(text, text, true);
while (st.hasMoreTokens()) {
String current = st.nextToken();
if(current.getBytes().length <= 3){
result.append(current);
}
}
return result.toString();
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With