Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy like tool for Nodejs? [closed]

I would like to know if there is something like Scrapy for nodejs ?. if not what do you think of using the simple page download and parsing it using cheerio ? is there a better way.

like image 273
user2422940 Avatar asked Oct 30 '14 11:10

user2422940


People also ask

Is Nodejs good for web scraping?

Web scraping is the process of extracting data from a website in an automated way and Node. js can be used for web scraping. Even though other languages and frameworks are more popular for web scraping, Node. js can be utilized well to do the job too.

Can Scrapy render JavaScript?

Executing JavaScript in Scrapy with ScrapingBee ScrapingBee is a web scraping API that handles headless browsers and proxies for you. ScrapingBee uses the latest headless Chrome version and supports JavaScript scripts. Like the other two middlewares, you can simply install the scrapy-scrapingbee middleware with pip.

Is Python a Scrapy?

Scrapy is a Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format.


3 Answers

Scrapy is a library that adds asynchronous IO to python. The reason we don't have something like that for node is because all IO is already asynchronous (unless you need it not to be).

Here's what a scrapy script might look like in node and notice that the urls are processed concurrently.

const cheerio = require('cheerio');
const axios = require('axios');

const startUrls = ['http://www.google.com/', 'http://www.amazon.com/', 'http://www.wikipedia.com/']

// this might be called a "middleware" in scrapy.
const get = async url => {
  const response = await axios.get(url)
  return cheerio.load(response.data)
}

// this too.
const output = item => {
  console.log(item)
}

// here is parse which is the initial scrapy callback
const parse = async url => {
  const $ = await get(url)
  output({url, title: $('title').text()})
}

// and here is the main execution
startUrls.map(url => parse(url))
like image 154
pguardiario Avatar answered Oct 18 '22 01:10

pguardiario


Exactly same thing? No. But both powerful and simple? Yes: crawler Quick example:

var Crawler = require("crawler");

var c = new Crawler({
    maxConnections : 10,
    // This will be called for each crawled page
    callback : function (error, res, done) {
        if(error){
            console.log(error);
        }else{
            var $ = res.$;
            // $ is Cheerio by default
            //a lean implementation of core jQuery designed specifically for the server
            console.log($("title").text());
        }
        done();
    }
});

// Queue just one URL, with default callback
c.queue('http://www.amazon.com');

// Queue a list of URLs
c.queue(['http://www.google.com/','http://www.yahoo.com']);

// Queue URLs with custom callbacks & parameters
c.queue([{
    uri: 'http://parishackers.org/',
    jQuery: false,

    // The global callback won't be called
    callback: function (error, res, done) {
        if(error){
            console.log(error);
        }else{
            console.log('Grabbed', res.body.length, 'bytes');
        }
        done();
    }
}]);

// Queue some HTML code directly without grabbing (mostly for tests)
c.queue([{
    html: '<p>This is a <strong>test</strong></p>'
}]);
like image 35
Mike Avatar answered Oct 18 '22 02:10

Mike


I haven't seen such a strong solution for crawling / indexing whole websites like Scrapy in python, so personally I use Python Scrapy for crawling websites.

But for scraping data from pages there is casperjs in nodejs. It is a very cool solution. It also works for ajax websites, e.g. angular-js pages. Python Scrapy cannot parse ajax pages. So for scraping data for one or few pages I prefer to use CasperJs.

Cheerio is really faster than casperjs, but it doesn't work with ajax pages and it doesn't have such good structure of a code like casperjs. So I prefer casperjs even when you can use cheerio package.

Coffee-script example:

casper.start 'https://reports.something.com/login', ->
  this.fill 'form',
    username: params.username
    password: params.password
  , true

casper.thenOpen queryUrl, {method:'POST', data:queryData}, ->
  this.click 'input'

casper.then ->
  get = (number) =>
    value = this.fetchText("tr[bgcolor= '#AFC5E4'] >  td:nth-of-type(#{number})").trim()
like image 3
Stan Avatar answered Oct 18 '22 00:10

Stan