Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NodeJS. Dealing with � characters encoding

I'm having difficulties dealing with character encoding. I'm trying to scrape the following url:

http://www.google.com/movies?near=Montreal&date=0

My code looks like this:

var http = require('http');
var url = require('url');
var Iconv  = require('iconv').Iconv;

var location = 'montreal';

var googleMovies = url.parse("http://www.google.com/movies?near=" + location);

var req = http.request(googleMovies, function(response) {
    var str = '';
    response.on('data', function(chunk) {
        str += chunk;
    });
    response.on('end', function() {

        var iconv = new Iconv('latin1', 'UTF-8');
        str = iconv.convert(str).toString();

        console.log(str);
    });
});
req.end()

I've first tried without:

    var iconv = new Iconv('latin1', 'UTF-8');
    str = iconv.convert(str).toString();

but that was causing the � characters.

I've tested the source listed above on this page:

http://nlp.fi.muni.cz/projects/chared/

and it seem to detect it as latin1, but things could be wrong.

like image 384
Tomasz Rakowski Avatar asked Oct 27 '14 02:10

Tomasz Rakowski


People also ask

What encoding does Nodejs use?

The character encodings currently supported by Node.js are the following: 'utf8' (alias: 'utf-8' ): Multi-byte encoded Unicode characters. Many web pages and other document formats use UTF-8. This is the default character encoding.

What is UTF-8 in node JS?

Overview. In this guide, you can learn how to enable or disable the Node. js driver's UTF-8 validation feature. UTF-8 is a character encoding specification that ensures compatibility and consistent presentation across most operating systems, applications, and language character sets.

How do I decode a string in node?

It can be accessed using: const { StringDecoder } = require('node:string_decoder'); The following example shows the basic use of the StringDecoder class. const { StringDecoder } = require('node:string_decoder'); const decoder = new StringDecoder('utf8'); const cent = Buffer.

What is setEncoding?

Sets the character encoding (character set) of Form and URL. scope variable values; used when the character encoding of. the input to a form, or the character encoding of a URL, is. not in UTF-8 encoding.


2 Answers

The � characters come from the concatenation:

response.on('data', function(chunk) {
    str += chunk;
});

This converts each chunk to a String with the default encoding of utf8. Any sequences in the Buffers that aren't valid as UTF-8 will be lost and replaced by � at this point.

You'll want to leave the chunks as Buffers until after the convert(). They can be collected in an Array and combined with Buffer.concat().

var chunks = [];

response.on('data', function (chunk) {
    chunks.push(chunk);
});

response.on('end', function () {
    var iconv = new Iconv('latin1', 'UTF-8');
    var str = iconv.convert(Buffer.concat(chunks)).toString();
    console.log(str);
});
like image 106
Jonathan Lonowski Avatar answered Nov 15 '22 01:11

Jonathan Lonowski


If you set your User-Agent to that of a desktop browser, the meta tag in the HTML and the Content-Type in the response headers will have the charset set to UTF-8 instead of latin1. Example:

var dest = url.parse('http://www.google.com/movies?near=montreal');
dest.headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36',
};

http.get(dest, function(response) {
  var str = '';

  response.on('data', function(chunk) {
    str += chunk;
  }).on('end', function() {
    console.log(str);
  }).setEncoding('utf8');
});
like image 42
mscdex Avatar answered Nov 14 '22 23:11

mscdex