What you have is EXTRATERRESTRIAL ALIEN (U+1F47D)
and BROKEN HEART (U+1F494)
which
are not in the basic multilingual plane. They cannot be even represented in java as one char, "👽💔".length() == 4
. They are definitely not null characters and one will see squares if you are not using fonts that support them.
MySQL's utf8
only supports basic multilingual plane, and you need to use utf8mb4
instead:
For a supplementary character, utf8 cannot store the character at all, while utf8mb4 requires four bytes to store it. Since utf8 cannot store the character at all, you do not have any supplementary characters in utf8 columns and you need not worry about converting characters or losing data when upgrading utf8 data from older versions of MySQL.
So to support these characters, your MySQL needs to be 5.5+ and you need to use utf8mb4
everywhere. Connection encoding needs to be utf8mb4
, character set needs to be utf8mb4
and collaction needs to be utf8mb4
. For java it's still just "utf-8"
, but MySQL needs a distinction.
I don't know what driver you are using but a driver agnostic way to set connection charset is to send the query:
SET NAMES 'utf8mb4'
Right after making the connection.
See also this for Connector/J:
14.14: How can I use 4-byte UTF8, utf8mb4 with Connector/J?
To use 4-byte UTF8 with Connector/J configure the MySQL server with character_set_server=utf8mb4. Connector/J will then use that setting as long as characterEncoding has not been set in the connection string. This is equivalent to autodetection of the character set.
Adjust your columns and database as well:
var1 varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci NOT NULL
Again, your MySQL version needs to be relatively up-to-date for utf8mb4 support.
Weirdly, I found that REMOVING &characterEncoding=UTF-8
from the JDBC url
did the trick for me with similar issues.
Based on my properties,
jdbc_url=jdbc:mysql://localhost:3306/dbName?useUnicode=true
I think this supports what @Esailija has said above, i.e. my MySQL, which is indeed 5.5, is figuring out its own favorite flavor of UTF-8 encoding.
(Note, I'm also specifying the InputStream
I'm reading from as UTF-8
in the java code, which probably doesn't hurt)...
All in all, to save symbols that require 4 bytes you need to update characher-set and collation for utf8mb4
:
alter table <some_table> convert to character set utf8mb4 collate utf8mb4_unicode_ci
On my development enviromnt for #2 I prefer to set parameters on command line when starting the server:
mysqld --character-set-server=utf8mb4 --collation-server=utf8mb4_unicode_ci
btw, pay attention to Connector/J behavior with SET NAMES 'utf8mb4'
:
Do not issue the query set names with Connector/J, as the driver will not detect that the character set has changed, and will continue to use the character set detected during the initial connection setup.
And avoid setting characterEncoding
parameter in connection url as it will override configured server encoding:
To override the automatically detected encoding on the client side, use the characterEncoding property in the URL used to connect to the server.
How I solved my problem.
I had
?useUnicode=true&characterEncoding=UTF-8
In my hibernate jdbc connection url and I changed the string datatype to longtext in database, which was varchar before.
I faced the same issue and solved it by setting the Collation to utf8_general_ci for each column.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With