Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove characters not-suitable for UTF-8 encoding from String

I have a text-area on website where user can write anything. Problem happens when user copy paste some text or something which contains non-UTF 8 characters and submit them to server.

Java successfully handles it, as it support UTF-16 but my mySql table support UTF-8 and thus insertion fails.

I was trying to implement some way in business logic itself, to remove any characters which is not suitable for UTF-8 encoding.

Currently I am using this code:

new String(java.nio.charset.Charset.forName("UTF-8").encode(myString).array());

But it replaces characters not suitable for UTF-8 with some other obscure characters. Which also does not look good to end user. Could someone please throw some light over any possible solution to tackle this using Java code?

EDIT : For example, exception I got while insertion of such values

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x8A\x0D\x0A...' for column

java.sql.SQLException: Incorrect string value: '\xF0\x9F\x98\x80\xF0\x9F...' for column
like image 780
Abhi Avatar asked Jan 06 '15 08:01

Abhi


3 Answers

UTF-8 is not a character set, it's a character encoding, just like UTF-16.

UTF-8 is capable to encode any unicode character and any unicode text to a sequence of bytes, so there is no such thing as characters not suitable for UTF-8.

You're using a constructor of String which only takes a byte array (String(byte[] bytes)) which according to the javadocs:

Constructs a new String by decoding the specified array of bytes using the platform's default charset.

It uses the default charset of the platform to interpret the bytes (to convert the bytes to characters). Do not use this. Instead when converting a byte array to String, specify the encoding you wish to use explicitly with the String(byte[] bytes, Charset charset) constructor.

If you have issues with certain characters, that is most likely due to using different character sets or encodings at the server side and at the client side (brownser+HTML). Make sure you use UTF-8 everywhere, do not mix encodings and do not use the default encoding of the platform.

Some readings how to achieve this:

How to get UTF-8 working in Java webapps?

like image 56
icza Avatar answered Sep 21 '22 10:09

icza


Maybe the answer with the CharsetDecoder of this question helps. You could change the CodingErrorAction to REPLACE and set a replacement in my example "?". This will output a given replacement string for invalid byte sequences. In this example a UTF-8 decoder capability and stress test file is read and decoded:

CharsetDecoder utf8Decoder = Charset.forName("UTF-8").newDecoder();
utf8Decoder.onMalformedInput(CodingErrorAction.REPLACE);
utf8Decoder.onUnmappableCharacter(CodingErrorAction.REPLACE);
utf8Decoder.replaceWith("?");

// Read stress file
Path path = Paths.get("<path>/UTF-8-test.txt");
byte[] data = Files.readAllBytes(path);
ByteBuffer input = ByteBuffer.wrap(data);

// UTF-8 decoding
CharBuffer output = utf8Decoder.decode(input);

// Char buffer to string
String outputString = output.toString();

System.out.println(outputString);
like image 27
gclaussn Avatar answered Sep 21 '22 10:09

gclaussn


You will run into this problem when the MySQL column is encoded with old utf8 using only 3 bytes per character and the value contains a 4-byte character.

The actual solution is to use utf8mb4 instead of utf8 in MySQL.

Otherwise here is my dirty workaround to remove all 4-byte chars:

public String removeUtf8Mb4(String text) {
    StringBuilder result = new StringBuilder();
    StringTokenizer st = new StringTokenizer(text, text, true);
    while (st.hasMoreTokens()) {
        String current = st.nextToken();
        if(current.getBytes().length <= 3){
            result.append(current);
        }
    }
    return result.toString();
}
like image 39
Roman K Avatar answered Sep 23 '22 10:09

Roman K