Can Scrapy be replaced by pyspider?

Tags:

I've been using Scrapy web-scraping framework pretty extensively, but, recently I've discovered that there is another framework/system called pyspider, which, according to it's github page, is fresh, actively developed and popular.

pyspider's home page lists several things being supported out-of-the-box:

Powerful WebUI with script editor, task monitor, project manager and result viewer

Javascript pages supported!

Task priority, retry, periodical and recrawl by age or marks in index page (like update time)

Distributed architecture

These are the things that Scrapy itself doesn't provide, but, it is possible with the help of portia (for Web UI), scrapyjs (for js pages) and scrapyd (deploying and distributing through API).

Is it true that pyspider alone can replace all of these tools? In other words, is pyspider a direct alternative to Scrapy? If not, then which use cases does it cover?

^{I hope I'm not crossing "too broad" or "opinion-based" line.}

791

asked Dec 02 '14 06:12

alecxe

2 Answers

pyspider and Scrapy have the same purpose, web scraping, but a different view about doing that.

spider should never stop till WWW dead. (information is changing, data is updating in websites, spider should have the ability and responsibility to scrape latest data. That's why pyspider has URL database, powerful scheduler, @every, age, etc..)
pyspider is a service more than a framework. (Components are running in isolated process, lite - all version is running as service too, you needn't have a Python environment but a browser, everything about fetch or schedule is controlled by script via API not startup parameters or global configs, resources/projects is managed by pyspider, etc...)
pyspider is a spider system. (Any components can been replaced, even developed in C/C++/Java or any language, for better performance or larger capacity)

and

on_start vs start_url
token bucket traffic control vs download_delay
return json vs class Item
message queue vs Pipeline
built-in url database vs set
Persistence vs In-memory
PyQuery + any third package you like vs built-in CSS/Xpath support

In fact, I have not referred much from Scrapy. pyspider is really different from Scrapy.

But, why not try it yourself? pyspider is also fast, has easy-to-use API and you can try it without install.

answered Oct 15 '22 07:10

Binux

Since I use both scrapy and pyspider, I would like to suggest the following:

If the website is really small / simple, try pyspider first since it has almost everything you need

Use webui to setup project
Try the online code editor and view parse result instantly
View the result easily in browser
Run/Pause the project
Setup the expiration date so it can re-process the url

However, if you tried pyspider and found it can't fit your needs, it's time to use scrapy. - migrate on_start to start_request - migrate index_page to parse - migrate detail_age to detail_age - change self.crawl to response.follow

Then you are almost done. Now you can play with scrapy's advanced features like middleware, items, pipline etc.

answered Oct 15 '22 07:10

Kai Huang

Related questions
                            
                                REST API error code 500 handling
                            
                                How to use Bower with private Bitbucket repository?
                            
                                Parallel stream from a HashSet doesn't run in parallel
                            
                                What's the disadvantage of LDA for short texts?
                            
                                Finding length of the longest list in an irregular list of lists
                            
                                Why do you need Arbitraries in scalacheck?
                            
                                Functional Java - Interaction between whenComplete and exceptionally
                            
                                How do I create file hardlink in PowerShell on Windows 10?
                            
                                How to minify CSS and JavaScript files in Visual Studio 2015
                            
                                DisconnectedContext error when running Unit Tests in debug in VS2015
                            
                                django:django.core.exceptions.AppRegistryNotReady: Apps aren't loaded yet
                            
                                How do I run PhantomJS on AWS Lambda with NodeJS

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With