Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to most efficiently parse a web page using Node.js

I need to parse a simple web page and get data from html, such as "src", "data-attr", etc. How can I do this most efficiently using Node.js? If it helps, I'm using Node.js 0.8.x.

P.S. This is the site I'm parsing. I want to get a list of current tracks and make my own html5 app for listen on mobile devices.

like image 373
NiLL Avatar asked Sep 13 '12 10:09

NiLL


People also ask

Is Nodejs good for web scraping?

Web scraping is the process of extracting data from a website in an automated way and Node. js can be used for web scraping. Even though other languages and frameworks are more popular for web scraping, Node. js can be utilized well to do the job too.

How can I improve node JS API performance?

Caching is one of the common ways of improving the Node Js performance. Caching can be done for both client-side and server-side web applications. However, server-side caching is the most preferred choice for Node Js performance optimization because it has JavaScript, CSS sheets, HTML pages, etc.


2 Answers

I have done this a lot. You'll want to use PhantomJS if the website that you're scraping is heavily using JavaScript. Note that PhantomJS is not Node.js. It's a completely different JavaScript runtime. You can integrate through phantomjs-node or node-phantom, but they are both kinda hacky. YMMV with those. Avoid anything to do with jsdom. It'll cause you headaches - this includes Zombie.js.

What you should use is Cheerio in conjunction with Request. This will be sufficient for most web pages.

I wrote a blog post on using Cheerio with Request: Quick and Dirty Screen Scraping with Node.js But, again, if it's JavaScript intensive, use PhantomJS in conjunction with CasperJS.

Hope this helps.

Snippet using Request and Cheerio:

var request = require('request')   , cheerio = require('cheerio');  var searchTerm = 'screen+scraping'; var url = 'http://www.bing.com/search?q=' + searchTerm;  request(url, function(err, resp, body){   $ = cheerio.load(body);   links = $('.sb_tlst h3 a'); //use your CSS selector here   $(links).each(function(i, link){     console.log($(link).text() + ':\n  ' + $(link).attr('href'));   }); }); 
like image 108
JP Richardson Avatar answered Sep 17 '22 15:09

JP Richardson


You could try PhantomJS. Here's the documentation for using it for screen scraping.

like image 45
jabclab Avatar answered Sep 16 '22 15:09

jabclab