I'm having difficulties dealing with character encoding. I'm trying to scrape the following url:
http://www.google.com/movies?near=Montreal&date=0
My code looks like this:
var http = require('http');
var url = require('url');
var Iconv = require('iconv').Iconv;
var location = 'montreal';
var googleMovies = url.parse("http://www.google.com/movies?near=" + location);
var req = http.request(googleMovies, function(response) {
var str = '';
response.on('data', function(chunk) {
str += chunk;
});
response.on('end', function() {
var iconv = new Iconv('latin1', 'UTF-8');
str = iconv.convert(str).toString();
console.log(str);
});
});
req.end()
I've first tried without:
var iconv = new Iconv('latin1', 'UTF-8');
str = iconv.convert(str).toString();
but that was causing the � characters.
I've tested the source listed above on this page:
http://nlp.fi.muni.cz/projects/chared/
and it seem to detect it as latin1, but things could be wrong.
The character encodings currently supported by Node.js are the following: 'utf8' (alias: 'utf-8' ): Multi-byte encoded Unicode characters. Many web pages and other document formats use UTF-8. This is the default character encoding.
Overview. In this guide, you can learn how to enable or disable the Node. js driver's UTF-8 validation feature. UTF-8 is a character encoding specification that ensures compatibility and consistent presentation across most operating systems, applications, and language character sets.
It can be accessed using: const { StringDecoder } = require('node:string_decoder'); The following example shows the basic use of the StringDecoder class. const { StringDecoder } = require('node:string_decoder'); const decoder = new StringDecoder('utf8'); const cent = Buffer.
Sets the character encoding (character set) of Form and URL. scope variable values; used when the character encoding of. the input to a form, or the character encoding of a URL, is. not in UTF-8 encoding.
The � characters come from the concatenation:
response.on('data', function(chunk) {
str += chunk;
});
This converts each chunk
to a String
with the default encoding
of utf8
. Any sequences in the Buffer
s that aren't valid as UTF-8 will be lost and replaced by � at this point.
You'll want to leave the chunk
s as Buffer
s until after the convert()
. They can be collected in an Array
and combined with Buffer.concat()
.
var chunks = [];
response.on('data', function (chunk) {
chunks.push(chunk);
});
response.on('end', function () {
var iconv = new Iconv('latin1', 'UTF-8');
var str = iconv.convert(Buffer.concat(chunks)).toString();
console.log(str);
});
If you set your User-Agent
to that of a desktop browser, the meta tag in the HTML and the Content-Type
in the response headers will have the charset
set to UTF-8 instead of latin1. Example:
var dest = url.parse('http://www.google.com/movies?near=montreal');
dest.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.104 Safari/537.36',
};
http.get(dest, function(response) {
var str = '';
response.on('data', function(chunk) {
str += chunk;
}).on('end', function() {
console.log(str);
}).setEncoding('utf8');
});
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With