I have an application that allows users to persist strings to a database and those strings may contain emojis. The problem I have is an emoji such as 😊
will get stored in MySQL as ðŸ˜Å
When I retrieve this string using a PHP MySQL client and render it in a web browser, it renders fine probably because the Content-Type
is set to UTF-8. When I try to read the string in node.js, I get back what I think is the ISO8859-1 encoding a literal ðŸ˜Å
. The charset on the table is set to latin1
and that's where I'm getting ISO8859-1 from.
What's the right way to encode the string in node.js so that I can see the emoji and not the encoding set by MySQL when I console.log
the string?
ðŸ˜Å
is Mojibake for 😊
. Interpreting the former as latin1, you get hex F09F988A
, which is the UTF-8
hex for that Emoji.
(Note: UTF-8
outside MySQL is equivalent to utf8mb4
inside MySQL.)
In MySQL, you must have the column/table declared with CHARACTER SET utf8mb4
. You must also state that the data being stored/fetched is encoded utf8mb4
. Note: utf8
will not suffice.
Do a SELECT HEX(col) FROM ...
to see if you get that hex for that Emoji. If that is the case and the column is currently latin1
, then part of the fix is to carefully convert the column to utf8mb4. That is, you have CHARACTER SET latin1
, but have UTF-8 bytes in it; this will leave bytes alone while fixing charset. Assuming the column is already VARCHAR(111) CHARACTER SET latin1 NOT NULL
, then do this 2-step ALTER:
ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;
Virtually any other conversion mechanism will make a worse mess.
As for establishing the connection correctly, it goes something like this for node.js:
var connection = mysql.createConnection({ ... , charset : 'utf8mb4'});
You do not need, and should not convert encoding. Just use the right protocols. If you send the HTML page in UTF-8, the browser will send the data back to your server in UTF-8.
Then you want to store the data to your database which is in latin1
, that won't work at all. You must convert your database to UTF-8 as well. That includes the database, the tables, and eventually the columns themselves. Also make sure that your database client is configured to connect in UTF-8, because the client itself has to declare its encoding.
Once you have the whole data-flux in UTF-8, everything will work flawlessly.
Server -> GET HTML -> POST -> Server -> SQL Client -> Database -> Table -> Column
It is recommended to use iconv
(A simple ISO-8859-1 to UTF-8 conversion)
From this gist
var iconv = require('iconv');
function toUTF8(body) {
// convert from iso-8859-1 to utf-8
var ic = new iconv.Iconv('iso-8859-1', 'utf-8');
var buf = ic.convert(body);
return buf.toString('utf-8');
}
here if you pass anything in ISO-8859-1 , it will return it's UTF-8.
for example,
toUTF8("😊");
will return 😊
I have found a super dirty way to convert it back:
const isoToUtfTable = {
'ð': 0xf0,
'Ÿ': 0x9f,
'˜': 0x98,
'Å ': 0x8a
};
function convertISO8859ToUtf8(s) {
const buf = new Uint8Array([...s].map(c => isoToUtfTable[c]));
return String.fromCharCode(...buf)
}
function decode_utf8(s) {
return decodeURIComponent(escape(s));
}
console.log(decode_utf8(convertISO8859ToUtf8('😊')))
Now you simply need to complete the isoToUtfTable
table (it's small, see https://en.wikipedia.org/wiki/ISO/IEC_8859-1).
Maybe try to look at node-iconv.
const iconv = new Iconv('ISO-8859-2', 'UTF-8');
const buffer = iconv.convert(something);
console.log(buffer);
console.log(buffer.toString('UTF8'));
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With