Node.JS scrape encoding?

Tags:

I'm fetching this page with with this request library in Node.JS, and parsing the body using cheerio.

Calling $.html() on the parsed response body reveals that the title attribute for the page is:

<title>Le Relais de l'Entrec?te</title>

... when it should be:

<title>Le Relais de l'Entrecôte</title>

I've tried setting the options for the request library to include encoding: 'utf8', but that didn't seem to change anything.

How do I preserve these characters?

458

asked Sep 07 '12 23:09

neezer

1 Answers

You can use iconv (or better iconv-lite) for the conversion itself, but to detect the encoding you should check out the charset and jschardet modules. Here's an example of them both in action:

var charset = require('charset'),
    jschardet = require('jschardet'),
    Iconv = require('iconv').Iconv;

request.get({url: 'http://www.example.com', encoding: 'binary'}, function(err, res, body) {
    var enc = charset(res.headers, body) || jschardet.detect(body).encoding.toLowerCase();

    if(enc !== 'utf8') {
        var iconv = new Iconv(enc, 'UTF-8//TRANSLIT//IGNORE');
        body = iconv.convert(new Buffer(body, 'binary')).toString('utf8');
    }

    console.log(body);
});

134

answered Sep 19 '22 15:09

Ben Dowling

Related questions
                            
                                How to compile jade template file to get string?
                            
                                Google App Engine - Node: Cannot find module 'firebase-admin'
                            
                                .insertOne is not a function
                            
                                How to setTimeout on async await call node
                            
                                Accessing other models in a Sequelize model hook function
                            
                                Add module name in winston log entries
                            
                                How do you correctly use parallelshell with npm scripts?
                            
                                The term 'node' is not recognized... In Powershell
                            
                                Renaming an uploaded file using Multer doesn't work (Express.js)
                            
                                How to install Node.js, npm, socket.io and use them? [closed]
                            
                                nodemon ''mocha' is not recognized as an internal or external command, operable program or batch file
                            
                                Strange Mongoose schema.js error - `options` may not be used as a schema pathname
                            
                                Drop and create ENUM with sequelize correctly?
                            
                                docker-compose up can't find module, but running from bash works
                            
                                Node.js pass variable to route
                            
                                Module Error (Emitted value instead of an instance of Error)
                            
                                Why are my Mongoose 3.8.7 schema getters and setters being ignored?
                            
                                Node.js throws er Unhandled error event
                            
                                how to install nodejs 0.10.26 from binaries in Ubuntu
                            
                                node.js process.env: assigning process.env property to undefined results in string type?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Node.JS scrape encoding?

Tags:

node.js

encoding

unicode

neezer

People also ask

1 Answers

Ben Dowling

Recent Activity

Donate For Us