I'm fetching this page with with this request library in Node.JS, and parsing the body using cheerio.
Calling $.html()
on the parsed response body reveals that the title attribute for the page is:
<title>Le Relais de l'Entrec?te</title>
... when it should be:
<title>Le Relais de l'Entrecôte</title>
I've tried setting the options for the request library to include encoding: 'utf8'
, but that didn't seem to change anything.
How do I preserve these characters?
Web scraping is the process of extracting data from a website in an automated way and Node. js can be used for web scraping. Even though other languages and frameworks are more popular for web scraping, Node. js can be utilized well to do the job too.
You can use iconv (or better iconv-lite) for the conversion itself, but to detect the encoding you should check out the charset and jschardet modules. Here's an example of them both in action:
var charset = require('charset'),
jschardet = require('jschardet'),
Iconv = require('iconv').Iconv;
request.get({url: 'http://www.example.com', encoding: 'binary'}, function(err, res, body) {
var enc = charset(res.headers, body) || jschardet.detect(body).encoding.toLowerCase();
if(enc !== 'utf8') {
var iconv = new Iconv(enc, 'UTF-8//TRANSLIT//IGNORE');
body = iconv.convert(new Buffer(body, 'binary')).toString('utf8');
}
console.log(body);
});
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With