Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Node.js encode ISO8859-1 to UTF-8

I have an application that allows users to persist strings to a database and those strings may contain emojis. The problem I have is an emoji such as 😊 will get stored in MySQL as 😊

When I retrieve this string using a PHP MySQL client and render it in a web browser, it renders fine probably because the Content-Type is set to UTF-8. When I try to read the string in node.js, I get back what I think is the ISO8859-1 encoding a literal 😊. The charset on the table is set to latin1 and that's where I'm getting ISO8859-1 from.

What's the right way to encode the string in node.js so that I can see the emoji and not the encoding set by MySQL when I console.log the string?

like image 430
randombits Avatar asked Mar 30 '17 01:03

randombits


5 Answers

😊 is Mojibake for 😊. Interpreting the former as latin1, you get hex F09F988A, which is the UTF-8 hex for that Emoji.

(Note: UTF-8 outside MySQL is equivalent to utf8mb4 inside MySQL.)

In MySQL, you must have the column/table declared with CHARACTER SET utf8mb4. You must also state that the data being stored/fetched is encoded utf8mb4. Note: utf8 will not suffice.

Do a SELECT HEX(col) FROM ... to see if you get that hex for that Emoji. If that is the case and the column is currently latin1, then part of the fix is to carefully convert the column to utf8mb4. That is, you have CHARACTER SET latin1, but have UTF-8 bytes in it; this will leave bytes alone while fixing charset. Assuming the column is already VARCHAR(111) CHARACTER SET latin1 NOT NULL, then do this 2-step ALTER:

ALTER TABLE tbl MODIFY COLUMN col VARBINARY(111) NOT NULL;
ALTER TABLE tbl MODIFY COLUMN col VARCHAR(111) CHARACTER SET utf8mb4 NOT NULL;

Virtually any other conversion mechanism will make a worse mess.

As for establishing the connection correctly, it goes something like this for node.js:

var connection = mysql.createConnection({ ... , charset : 'utf8mb4'});
like image 83
Rick James Avatar answered Oct 20 '22 18:10

Rick James


You do not need, and should not convert encoding. Just use the right protocols. If you send the HTML page in UTF-8, the browser will send the data back to your server in UTF-8.

Then you want to store the data to your database which is in latin1, that won't work at all. You must convert your database to UTF-8 as well. That includes the database, the tables, and eventually the columns themselves. Also make sure that your database client is configured to connect in UTF-8, because the client itself has to declare its encoding.

Once you have the whole data-flux in UTF-8, everything will work flawlessly.

Server -> GET HTML -> POST -> Server -> SQL Client -> Database -> Table -> Column

like image 30
Guillaume F. Avatar answered Oct 20 '22 19:10

Guillaume F.


It is recommended to use iconv(A simple ISO-8859-1 to UTF-8 conversion)

From this gist

var iconv = require('iconv');

function toUTF8(body) {
  // convert from iso-8859-1 to utf-8
  var ic = new iconv.Iconv('iso-8859-1', 'utf-8');
  var buf = ic.convert(body);
  return buf.toString('utf-8');
}

here if you pass anything in ISO-8859-1 , it will return it's UTF-8.

for example,

toUTF8("😊");

will return 😊

like image 3
Sagar V Avatar answered Oct 20 '22 17:10

Sagar V


I have found a super dirty way to convert it back:

    const isoToUtfTable = {
      'ð': 0xf0,
      'Ÿ': 0x9f,
      '˜': 0x98,
      'Å ': 0x8a
    };
    
    function convertISO8859ToUtf8(s) {
      const buf = new Uint8Array([...s].map(c => isoToUtfTable[c]));
      return String.fromCharCode(...buf)
    }
    
    function decode_utf8(s) {
      return decodeURIComponent(escape(s));
    }
    
    console.log(decode_utf8(convertISO8859ToUtf8('😊')))

Now you simply need to complete the isoToUtfTable table (it's small, see https://en.wikipedia.org/wiki/ISO/IEC_8859-1).

like image 2
zerkms Avatar answered Oct 20 '22 19:10

zerkms


Maybe try to look at node-iconv.

const iconv = new Iconv('ISO-8859-2', 'UTF-8');
const buffer = iconv.convert(something);
console.log(buffer);
console.log(buffer.toString('UTF8'));
like image 1
Samuel Tulach Avatar answered Oct 20 '22 19:10

Samuel Tulach