Scrapy like tool for Nodejs? [closed]

3 Answers

Scrapy is a library that adds asynchronous IO to python. The reason we don't have something like that for node is because all IO is already asynchronous (unless you need it not to be).

Here's what a scrapy script might look like in node and notice that the urls are processed concurrently.

const cheerio = require('cheerio');
const axios = require('axios');

const startUrls = ['http://www.google.com/', 'http://www.amazon.com/', 'http://www.wikipedia.com/']

// this might be called a "middleware" in scrapy.
const get = async url => {
  const response = await axios.get(url)
  return cheerio.load(response.data)
}

// this too.
const output = item => {
  console.log(item)
}

// here is parse which is the initial scrapy callback
const parse = async url => {
  const $ = await get(url)
  output({url, title: $('title').text()})
}

// and here is the main execution
startUrls.map(url => parse(url))

154

answered Oct 18 '22 01:10

Mike

I haven't seen such a strong solution for crawling / indexing whole websites like Scrapy in python, so personally I use Python Scrapy for crawling websites.

But for scraping data from pages there is casperjs in nodejs. It is a very cool solution. It also works for ajax websites, e.g. angular-js pages. Python Scrapy cannot parse ajax pages. So for scraping data for one or few pages I prefer to use CasperJs.

Cheerio is really faster than casperjs, but it doesn't work with ajax pages and it doesn't have such good structure of a code like casperjs. So I prefer casperjs even when you can use cheerio package.

Coffee-script example:

casper.start 'https://reports.something.com/login', ->
  this.fill 'form',
    username: params.username
    password: params.password
  , true

casper.thenOpen queryUrl, {method:'POST', data:queryData}, ->
  this.click 'input'

casper.then ->
  get = (number) =>
    value = this.fetchText("tr[bgcolor= '#AFC5E4'] >  td:nth-of-type(#{number})").trim()

answered Oct 18 '22 00:10

Stan

Related questions
                            
                                Converting First Charcter of String to Upper Case [duplicate]
                            
                                Way to divide a single background image across multiple divs to create "windowed" effect?
                            
                                Internet Explorer adds height- and width-attributes to a newly appended image automatically
                            
                                What is the most idiomatic way to handle variables declared in multiple for loops? [closed]
                            
                                Socket.io: How to limit the size of emitted data from client to the websocket server
                            
                                Updating React component state in Jasmine Test
                            
                                why does alert() break code execution?
                            
                                Run Karma using Gradle?
                            
                                Integrating image with jQuery AutoComplete
                            
                                Hover state is maintained during a transition even if the element has gone
                            
                                How to detect a change event on an iOS7 select menu, before the user closes the popup?
                            
                                AngularJS: How to load JSON data onto a scope variable
                            
                                Node.js, Express and Dependency Injection
                            
                                Different behavior of blur event in different browsers
                            
                                Tampermonkey ignores @exclude
                            
                                Preventing scrolling when a nested Bootstrap Dropdown <li> clicked
                            
                                Caching URL view/state with parameters
                            
                                SVG not rendering properly as a backbone view
                            
                                RxJS first() for Observable.of() - no elements in sequence
                            
                                angularjs ng-disabled doesn't add disabled to button

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy like tool for Nodejs? [closed]

Tags:

javascript

node.js

web-scraping

scrapy

cheerio

user2422940

People also ask

3 Answers

pguardiario

Mike

Stan

Recent Activity

Donate For Us