NodeJS. Dealing with � characters encoding

Tags:

node.js

character-encoding

I'm having difficulties dealing with character encoding. I'm trying to scrape the following url:

http://www.google.com/movies?near=Montreal&date=0

My code looks like this:

var http = require('http');
var url = require('url');
var Iconv  = require('iconv').Iconv;

var location = 'montreal';

var googleMovies = url.parse("http://www.google.com/movies?near=" + location);

var req = http.request(googleMovies, function(response) {
    var str = '';
    response.on('data', function(chunk) {
        str += chunk;
    });
    response.on('end', function() {

        var iconv = new Iconv('latin1', 'UTF-8');
        str = iconv.convert(str).toString();

        console.log(str);
    });
});
req.end()

I've first tried without:

    var iconv = new Iconv('latin1', 'UTF-8');
    str = iconv.convert(str).toString();

but that was causing the � characters.

I've tested the source listed above on this page:

http://nlp.fi.muni.cz/projects/chared/

and it seem to detect it as latin1, but things could be wrong.

384

asked Oct 27 '14 02:10

Tomasz Rakowski

2 Answers

The � characters come from the concatenation:

response.on('data', function(chunk) {
    str += chunk;
});

This converts each chunk to a String with the default encoding of utf8. Any sequences in the Buffers that aren't valid as UTF-8 will be lost and replaced by � at this point.

You'll want to leave the chunks as Buffers until after the convert(). They can be collected in an Array and combined with Buffer.concat().

var chunks = [];

response.on('data', function (chunk) {
    chunks.push(chunk);
});

response.on('end', function () {
    var iconv = new Iconv('latin1', 'UTF-8');
    var str = iconv.convert(Buffer.concat(chunks)).toString();
    console.log(str);
});

106

answered Nov 15 '22 01:11

Jonathan Lonowski

If you set your User-Agent to that of a desktop browser, the meta tag in the HTML and the Content-Type in the response headers will have the charset set to UTF-8 instead of latin1. Example:

var dest = url.parse('http://www.google.com/movies?near=montreal');
dest.headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36',
};

http.get(dest, function(response) {
  var str = '';

  response.on('data', function(chunk) {
    str += chunk;
  }).on('end', function() {
    console.log(str);
  }).setEncoding('utf8');
});

answered Nov 14 '22 23:11

mscdex

Related questions
                            
                                How to set tarball url of node-gyp via environment variable
                            
                                PhantomJS Crash - Exit Code 126
                            
                                Node.js tutorial web server not responding
                            
                                Let NodeJS application update itself using NPM
                            
                                How Nodejs EventEmitter.once() work?
                            
                                Is there virtualenv for Node.js?
                            
                                how to define functions in redis \ lua?
                            
                                What is a module and difference between module.exports vs exports?
                            
                                Unable to serve image (png) with node.js and express
                            
                                Meteor.js possible with Cassandra instead of MongDB? [closed]
                            
                                Unable to read a saved file in heroku
                            
                                gulp-order node module with merged streams
                            
                                Migrate Q to BlueBird (or Vow)
                            
                                NodeJS add two hours to date?
                            
                                Using the find method on a MongoDB collection with Monk
                            
                                Node.js's python child script outputting on finish, not real time
                            
                                nodejs node-http-proxy setup with cache
                            
                                MongoDB - get documents with max attribute per group in a collection
                            
                                Trying to understand how promisification works with BlueBird
                            
                                MongoDB unique index custom error message E11000

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With