I would like to know if there is something like Scrapy for nodejs ?. if not what do you think of using the simple page download and parsing it using cheerio ? is there a better way.
Web scraping is the process of extracting data from a website in an automated way and Node. js can be used for web scraping. Even though other languages and frameworks are more popular for web scraping, Node. js can be utilized well to do the job too.
Executing JavaScript in Scrapy with ScrapingBee ScrapingBee is a web scraping API that handles headless browsers and proxies for you. ScrapingBee uses the latest headless Chrome version and supports JavaScript scripts. Like the other two middlewares, you can simply install the scrapy-scrapingbee middleware with pip.
Scrapy is a Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format.
Scrapy is a library that adds asynchronous IO to python. The reason we don't have something like that for node is because all IO is already asynchronous (unless you need it not to be).
Here's what a scrapy script might look like in node and notice that the urls are processed concurrently.
const cheerio = require('cheerio');
const axios = require('axios');
const startUrls = ['http://www.google.com/', 'http://www.amazon.com/', 'http://www.wikipedia.com/']
// this might be called a "middleware" in scrapy.
const get = async url => {
const response = await axios.get(url)
return cheerio.load(response.data)
}
// this too.
const output = item => {
console.log(item)
}
// here is parse which is the initial scrapy callback
const parse = async url => {
const $ = await get(url)
output({url, title: $('title').text()})
}
// and here is the main execution
startUrls.map(url => parse(url))
Exactly same thing? No. But both powerful and simple? Yes: crawler Quick example:
var Crawler = require("crawler");
var c = new Crawler({
maxConnections : 10,
// This will be called for each crawled page
callback : function (error, res, done) {
if(error){
console.log(error);
}else{
var $ = res.$;
// $ is Cheerio by default
//a lean implementation of core jQuery designed specifically for the server
console.log($("title").text());
}
done();
}
});
// Queue just one URL, with default callback
c.queue('http://www.amazon.com');
// Queue a list of URLs
c.queue(['http://www.google.com/','http://www.yahoo.com']);
// Queue URLs with custom callbacks & parameters
c.queue([{
uri: 'http://parishackers.org/',
jQuery: false,
// The global callback won't be called
callback: function (error, res, done) {
if(error){
console.log(error);
}else{
console.log('Grabbed', res.body.length, 'bytes');
}
done();
}
}]);
// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
html: '<p>This is a <strong>test</strong></p>'
}]);
I haven't seen such a strong solution for crawling / indexing whole websites like Scrapy in python, so personally I use Python Scrapy for crawling websites.
But for scraping data from pages there is casperjs in nodejs. It is a very cool solution. It also works for ajax websites, e.g. angular-js pages. Python Scrapy cannot parse ajax pages. So for scraping data for one or few pages I prefer to use CasperJs.
Cheerio is really faster than casperjs, but it doesn't work with ajax pages and it doesn't have such good structure of a code like casperjs. So I prefer casperjs even when you can use cheerio package.
Coffee-script example:
casper.start 'https://reports.something.com/login', ->
this.fill 'form',
username: params.username
password: params.password
, true
casper.thenOpen queryUrl, {method:'POST', data:queryData}, ->
this.click 'input'
casper.then ->
get = (number) =>
value = this.fetchText("tr[bgcolor= '#AFC5E4'] > td:nth-of-type(#{number})").trim()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With