Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Crawling with Node.js

Tags:

node.js

Complete Node.js noob, so dont judge me...

I have a simple requirement. Crawl a web site, find all the product pages, and save some data from the product pages.

Simpler said then done.

Looking at Node.js samples, i cant find something similar.

There a request scraper:

request({uri:'http://www.google.com'}, function (error, response, body) {
  if (!error && response.statusCode == 200) {
    var window = jsdom.jsdom(body).createWindow();
    jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
      // jQuery is now loaded on the jsdom window created from 'body'
      jQuery('.someClass').each(function () { /* Your custom logic */ });
    });
  }
});

But i cant figure out how to call it self once it scrapes the root page, or to populate an array or url's that it needs to scrape.

Then there's the http agent way:

var agent = httpAgent.create('www.google.com', ['finance', 'news', 'images']);

agent.addListener('next', function (err, agent) {
  var window = jsdom.jsdom(agent.body).createWindow();
  jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
    // jQuery is now loaded on the jsdom window created from 'agent.body'
    jquery('.someClass').each(function () { /* Your Custom Logic */ });

    agent.next();
  });
});

agent.addListener('stop', function (agent) {
  sys.puts('the agent has stopped');
});

agent.start();

Which takes an array of locations, but then again, once you get it started with an array, you cant add more locations to it to go through all the product pages.

And i cant even get Apricot working, for some reason i'm getting an error.

So, how do i modify any of the above examples (or anything not listed above) to scrape a site, find all the product pages, find some data in there (the jquery.someclass example should do the trick) and that save that to a db?

Thanks!

like image 277
R0b0tn1k Avatar asked Mar 20 '11 11:03

R0b0tn1k


2 Answers

Personally, I use Node IO to scrape some websites. https://github.com/chriso/node.io

More details about scraping can be found in the wiki !


like image 87
Sandro Munda Avatar answered Sep 30 '22 11:09

Sandro Munda


I've had pretty good success crawling and scraping with Casperjs. It's a pretty nice library built on top of Phantomjs. I like it because it's fairly succinct. Callbacks can be executed as foo.then() which is super-simple to understand and I even can use jQuery since Phantomjs is an implementation of webkit. For example, the following would instantiate an instance of Casper and push all links on an archive page to an array called 'links'.

var casper = require("casper").create();

var numberOfLinks = 0;
var currentLink = 0;
var links = [];
var buildPage, capture, selectLink, grabContent, writeContent;

casper.start("http://www.yoursitehere.com/page_to/scrape/", function() {
    numberOfLinks = this.evaluate(function() {
        return __utils__.findAll('.nav-selector a').length;
    });
    this.echo(numberOfLinks + " items found");

    // cause jquery makes it easier
    casper.page.injectJs('/PATH/TO/jquery.js');
});


// Capture links
capture = function() {
    links = this.evaluate(function() {
        var link = [];
        jQuery('.nav-selector a').each(function() {
            link.push($(this).attr('href'));
        });
        return link;
    });
    this.then(selectLink);
};

You can then use node fs (or whatever else you want, really) to push your data into XML, CSV, or whatever you want. The example for scraping BBC photos was exceptionally helpful when I built my scraper.

This is a view from 10,000 feet of what casper can do. It has a very potent and broad API. I dig it, in case you couldn't tell :).

My full scraping example is here: https://gist.github.com/imjared/5201405.

like image 31
imjared Avatar answered Sep 30 '22 11:09

imjared