Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to encode/decode charset encoding in NodeJS?

I have this code :

request({ url: 'http://www.myurl.com/' }, function(error, response, html) {
  if (!error && response.statusCode == 200) {
    console.log($('title', html).text());
  }
});

But the websites that Im crawling can have different charset (utf8, iso-8859-1, etc..) how to get it and encode/decode the html always to the right encoding (utf8) ?

Thanks and sorry for my english ;)

like image 357
William Avatar asked Feb 11 '26 19:02

William


1 Answers

The website could return the content encoding in the content-type header or the content-type meta tag inside the returned HTML, eg:

<meta http-equiv="Content-Type" content="text/html; charset=latin1"/>

You can use the charset module to automatically check both of these for you. Not all websites or servers will specify an encoding though, so you'll want to fall back to detecting the charset from the data itself. The jschardet module can help you with that.

Once you've worked out the charset you can use the iconv module to do the actual conversion. Here's a full example:

request({url: 'http://www.myurl.com/', encoding: 'binary'}, function(error, response, html) {
    enc = charset(response.headers, html)
    enc = enc or jchardet.detect(html).encoding.toLowerCase()
    if enc != 'utf-8'
        iconv = new Iconv(enc, 'UTF-8//TRANSLIT//IGNORE')
        html = iconv.convert(new Buffer(html, 'binary')).toString('utf-8')
    console.log($('title', html).text());
});
like image 70
Ben Dowling Avatar answered Feb 15 '26 12:02

Ben Dowling



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!