I've been using Scrapy
web-scraping framework pretty extensively, but, recently I've discovered that there is another framework/system called pyspider
, which, according to it's github page, is fresh, actively developed and popular.
pyspider
's home page lists several things being supported out-of-the-box:
Powerful WebUI with script editor, task monitor, project manager and result viewer
Javascript pages supported!
Task priority, retry, periodical and recrawl by age or marks in index page (like update time)
Distributed architecture
These are the things that Scrapy
itself doesn't provide, but, it is possible with the help of portia
(for Web UI), scrapyjs
(for js pages) and scrapyd
(deploying and distributing through API).
Is it true that pyspider
alone can replace all of these tools? In other words, is pyspider
a direct alternative to Scrapy? If not, then which use cases does it cover?
I hope I'm not crossing "too broad" or "opinion-based" line.
A Powerful Spider(Web Crawler) System in Python. TRY IT NOW! Write script in Python. Powerful WebUI with script editor, task monitor, project manager and result viewer. MySQL, MongoDB, Redis, SQLite, Elasticsearch; PostgreSQL with SQLAlchemy as database backend.
So the difference between the two is actually quite large: Scrapy is a tool specifically created for downloading, cleaning and saving data from the web and will help you end-to-end; whereas BeautifulSoup is a smaller package which will only help you get information out of webpages.
Overview of ScrapyScrapy is a Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format.
Scrapy provides Item pipelines that allow you to write functions in your spider that can process your data such as validating data, removing data and saving data to a database. It provides spider Contracts to test your spiders and allows you to create generic and deep crawlers as well.
pyspider and Scrapy have the same purpose, web scraping, but a different view about doing that.
spider should never stop till WWW dead. (information is changing, data is updating in websites, spider should have the ability and responsibility to scrape latest data. That's why pyspider has URL database, powerful scheduler, @every
, age
, etc..)
pyspider is a service more than a framework. (Components are running in isolated process, lite - all
version is running as service too, you needn't have a Python environment but a browser, everything about fetch or schedule is controlled by script via API not startup parameters or global configs, resources/projects is managed by pyspider, etc...)
pyspider is a spider system. (Any components can been replaced, even developed in C/C++/Java or any language, for better performance or larger capacity)
and
on_start
vs start_url
download_delay
return json
vs class Item
Pipeline
set
In fact, I have not referred much from Scrapy. pyspider is really different from Scrapy.
But, why not try it yourself? pyspider is also fast, has easy-to-use API and you can try it without install.
Since I use both scrapy and pyspider, I would like to suggest the following:
If the website is really small / simple, try pyspider first since it has almost everything you need
However, if you tried pyspider and found it can't fit your needs, it's time to use scrapy. - migrate on_start to start_request - migrate index_page to parse - migrate detail_age to detail_age - change self.crawl to response.follow
Then you are almost done. Now you can play with scrapy's advanced features like middleware, items, pipline etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With