Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ScrapyRT vs Scrapyd

We've been using Scrapyd service for a while up until now. It provides a nice wrapper around a scrapy project and its spiders letting to control the spiders via an HTTP API:

Scrapyd is a service for running Scrapy spiders.

It allows you to deploy your Scrapy projects and control their spiders using a HTTP JSON API.

But, recently, I've noticed another "fresh" package - ScrapyRT that, according to the project description, sounds very promising and similar to Scrapyd:

HTTP server which provides API for scheduling Scrapy spiders and making requests with spiders.

Is this package an alternative to Scrapyd? If yes, what is the difference between the two?

like image 917
alecxe Avatar asked May 17 '16 18:05

alecxe


1 Answers

They don't have thaaat much in common. As you have already seen you have to deploy your spiders to scrapyd and then schedule crawls. scrapyd is a standalone service running on a server where you can deploy and run every project/spider you like.

With ScrapyRT you choose one of your projects and you cd to that directory. Then you run e.g. scrapyrt and you start crawls for spiders on that project through a simple (and very similar to scrapyd's) REST API. Then you get crawled items back as part of the JSON response.

It's a very nice idea and it looks fast, lean and well defined. Scrapyd on the other hand is more mature and more generic.

Here are some key differences:

  • Scrapyd supports multiple versions of spiders and multiple projects. As far as I can see if you want to run two different projects (or versions) with ScrapyRT you will have to use different ports for each.
  • Scrapyd provides infrastructure for keeping items in the server while ScrapyRT sends them back to you on the response which, for me, means that they should be in the order of a few MBs (instead of potentially GBs.) Similarly, the way logging is handled in scrapyd is more generic when compared to ScrapyRT.
  • Scrapyd (potentially persistently) queues jobs and gives you control over the number of Scrapy processes that run in parallel. ScrapyRT does something simple which as far as I can tell is to start a crawl for every request as soon as the request arrives. Blocking code in one of the spiders will block others as well.
  • ScrapyRT requires an url argument which as far as I can tell overrides any start_urls-related logic.

I would say that ScrapyRT and Scrapyd very cleverly don't overlap at this point in time. Of course you never know what future holds.

like image 98
neverlastn Avatar answered Nov 15 '22 01:11

neverlastn