Complete Node.js noob, so dont judge me...
I have a simple requirement. Crawl a web site, find all the product pages, and save some data from the product pages.
Simpler said then done.
Looking at Node.js samples, i cant find something similar.
There a request scraper:
request({uri:'http://www.google.com'}, function (error, response, body) {
if (!error && response.statusCode == 200) {
var window = jsdom.jsdom(body).createWindow();
jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
// jQuery is now loaded on the jsdom window created from 'body'
jQuery('.someClass').each(function () { /* Your custom logic */ });
});
}
});
But i cant figure out how to call it self once it scrapes the root page, or to populate an array or url's that it needs to scrape.
Then there's the http agent way:
var agent = httpAgent.create('www.google.com', ['finance', 'news', 'images']);
agent.addListener('next', function (err, agent) {
var window = jsdom.jsdom(agent.body).createWindow();
jsdom.jQueryify(window, 'path/to/jquery.js', function (window, jquery) {
// jQuery is now loaded on the jsdom window created from 'agent.body'
jquery('.someClass').each(function () { /* Your Custom Logic */ });
agent.next();
});
});
agent.addListener('stop', function (agent) {
sys.puts('the agent has stopped');
});
agent.start();
Which takes an array of locations, but then again, once you get it started with an array, you cant add more locations to it to go through all the product pages.
And i cant even get Apricot working, for some reason i'm getting an error.
So, how do i modify any of the above examples (or anything not listed above) to scrape a site, find all the product pages, find some data in there (the jquery.someclass example should do the trick) and that save that to a db?
Thanks!
Personally, I use Node IO to scrape some websites. https://github.com/chriso/node.io
More details about scraping can be found in the wiki !
I've had pretty good success crawling and scraping with Casperjs. It's a pretty nice library built on top of Phantomjs. I like it because it's fairly succinct. Callbacks can be executed as foo.then() which is super-simple to understand and I even can use jQuery since Phantomjs is an implementation of webkit. For example, the following would instantiate an instance of Casper and push all links on an archive page to an array called 'links'.
var casper = require("casper").create();
var numberOfLinks = 0;
var currentLink = 0;
var links = [];
var buildPage, capture, selectLink, grabContent, writeContent;
casper.start("http://www.yoursitehere.com/page_to/scrape/", function() {
numberOfLinks = this.evaluate(function() {
return __utils__.findAll('.nav-selector a').length;
});
this.echo(numberOfLinks + " items found");
// cause jquery makes it easier
casper.page.injectJs('/PATH/TO/jquery.js');
});
// Capture links
capture = function() {
links = this.evaluate(function() {
var link = [];
jQuery('.nav-selector a').each(function() {
link.push($(this).attr('href'));
});
return link;
});
this.then(selectLink);
};
You can then use node fs (or whatever else you want, really) to push your data into XML, CSV, or whatever you want. The example for scraping BBC photos was exceptionally helpful when I built my scraper.
This is a view from 10,000 feet of what casper can do. It has a very potent and broad API. I dig it, in case you couldn't tell :).
My full scraping example is here: https://gist.github.com/imjared/5201405.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With