Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Node.js unicode issue with HTTP response body

The response body of HTTP requests using the native 'http' module, displays question mark characters for unicode chars, instead of their actual value. Here's the basic snippet of code that I'm running.

var http = require('http');
var google = http.createClient(80, 'www.google.it');
var request = google.request('GET', '/',
{
 'host': 'www.google.it',
}
  );
request.end();
request.on('response', function (response) {
  response.setEncoding('utf8');
  response.on('data', function (chunk) {
    console.log(chunk);
  });
});

In the response there's a specific word that starts with "Pubblicit". Its last letter is a weird character that shows as a question mark to me. The word should be Pubblicità, instead it is displyed as Pubblicit?.

I have also tried outputting the data using .toString():

console.log(chunk.toString());

or

console.log(chunk.toString('utf8'));

But I'm getting the same results.

Any idea?

like image 569
Luca Matteis Avatar asked Nov 04 '11 10:11

Luca Matteis


2 Answers

I set response.setEncoding('binary'); and it works. No idea why though.

Reference: http://groups.google.com/group/nodejs/browse_thread/thread/3bd3935b1f42a5f4?pli=1

like image 139
Luca Matteis Avatar answered Nov 08 '22 00:11

Luca Matteis


Reason maybe that, if we do not specify a "googleKnownAsUTF8OK" user-agent on request header, google would response a html doc with content-type of ISO-8859-1(for old browsers,bots?i dont know), so decode the response buffer by "binary" is correct.

But, if we decode a buffer encoded in ISO-8859-1 by utf8, then the byte 0xe0(à) implies "form a character by 3bytes in a row", it is a malformed character in our case, so a few unexpected characters(depending on the environment) was displayed.

We may try "Mozilla/5.0" as value of user-agent. Good luck.

like image 30
user943702 Avatar answered Nov 07 '22 22:11

user943702