What is better method of scaling Scrapy?
scrapy
process and increasing CONCURRENT_REQUESTS
internal Scrapy's settingscrapy
processes but still focusing on increasing internal setting.scrapy
prcesses with some constant value of internal setting.If 3 - then what software is better to use for launching multiple scrapy processes?
And what is a best way to distribute scrapy
across multiple servers?
Scrapy does not support multi-threading because it is built on Twisted, which is an Asynchronous http protocol framework.
In the context of Scrapy, this means to send out “concurrent” requests instead of sending them one by one. In other words, this means that the Scrapy spider will send a X number of (simultaneous) requests to the web server at the same time.
Event-driven networking. Scrapy is written with Twisted, a popular event-driven networking framework for Python. Thus, it's implemented using a non-blocking (aka asynchronous) code for concurrency.
The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard. DOWNLOAD_DELAY = 0.25 # 250 ms of delay.
Scrapyd is a great tool for managing Scrapy processes. But the best answer I can give is that it depends. First you need to figure out where your bottleneck is.
If it is CPU intensive parsing, you should use multiple processes. Scrapy is able to handle 1000s of requests in parallel through Twisted's implementation of the Reactor pattern. But it uses only one process and no multi-threading and so it will utilize only a single core.
If it is just the number of requests that is limiting speed, tweak concurrent requests. Test your internet speed. To test how much bandwidth you have Then, go to your network resources in your system monitor, run your spider and see how much bandwidth you use compared to your max. Increase your concurrent requests until you stop seeing performance increases. The stop point could be determined by the site capacity, though only for small sites, the sites anti-scraping/DDoS programs (assuming you don't have proxies or vpns), your bandwidth or another chokepoint in the system. The last thing to know is that, while requests are handled in an async manner, items are not. If you have a lot of text and write everything locally, it will block requests while it writes. You will see lulls on the system monitor network panel. You can tweak your concurrent items and maybe get a smoother network usage, but it will still take the same amount of time. If you are using db writes, consider an insert delayed, or a queue with an execute many after a threshold, or both. Here is a pipeline someone wrote to handle all db writes async. The last choke point could be memory. I have run into this issue on an AWS micro instance, though on a laptop, it probably isn't an issue. If you don't need them, considering disabling the cache, cookies, and dupefilter. Of course they can be very helpful. Concurrent Items and Requests also take up memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With