I don't want to crawl simultaneously and get blocked. I would like to send one request per second.
Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it's implemented using a non-blocking (aka asynchronous) code for concurrency.
Basically it is a daemon that listens to requests for spiders to run. Scrapyd runs spiders in multiple processes, you can control the behavior with max_proc and max-proc-per-cpu settings: max_proc. The maximum number of concurrent Scrapy process that will be started.
This is an extension for automatically throttling crawling speed based on load of both the Scrapy server and the website you are crawling.
Scrapy, being one of the most popular web scraping frameworks, is a great choice if you want to learn how to scrape data from the web. In this tutorial, you'll learn how to get started with Scrapy and you'll also implement an example project to scrape an e-commerce website.
There is a setting for that:
DOWNLOAD_DELAY
Default:
0
The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard.
DOWNLOAD_DELAY = 0.25 # 250 ms of delay
Read the docs: https://doc.scrapy.org/en/latest/index.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With