Scrapy concurrency strategy

1 Answers

Scrapyd is a great tool for managing Scrapy processes. But the best answer I can give is that it depends. First you need to figure out where your bottleneck is.

If it is CPU intensive parsing, you should use multiple processes. Scrapy is able to handle 1000s of requests in parallel through Twisted's implementation of the Reactor pattern. But it uses only one process and no multi-threading and so it will utilize only a single core.

If it is just the number of requests that is limiting speed, tweak concurrent requests. Test your internet speed. To test how much bandwidth you have Then, go to your network resources in your system monitor, run your spider and see how much bandwidth you use compared to your max. Increase your concurrent requests until you stop seeing performance increases. The stop point could be determined by the site capacity, though only for small sites, the sites anti-scraping/DDoS programs (assuming you don't have proxies or vpns), your bandwidth or another chokepoint in the system. The last thing to know is that, while requests are handled in an async manner, items are not. If you have a lot of text and write everything locally, it will block requests while it writes. You will see lulls on the system monitor network panel. You can tweak your concurrent items and maybe get a smoother network usage, but it will still take the same amount of time. If you are using db writes, consider an insert delayed, or a queue with an execute many after a threshold, or both. Here is a pipeline someone wrote to handle all db writes async. The last choke point could be memory. I have run into this issue on an AWS micro instance, though on a laptop, it probably isn't an issue. If you don't need them, considering disabling the cache, cookies, and dupefilter. Of course they can be very helpful. Concurrent Items and Requests also take up memory.

answered Oct 22 '22 07:10

Will Madaus

Related questions
                            
                                How to connect to remote server with paramiko without a password?
                            
                                Ignore a specific test using Django [duplicate]
                            
                                How to find and count emoticons in a string using python?
                            
                                PyQt5: How can I connect a QPushButton to a slot?
                            
                                How to know if a paramiko SSH channel is disconnected?
                            
                                Python: is "except KeyError" faster than "if key in dict"?
                            
                                Override JSONSerializer on django rest framework
                            
                                Flask SQLAlchemy filter records from foreign key relationship
                            
                                Default fonts in Seaborn statistical data visualization in iPython
                            
                                Is there a python strip function equivalent in javascript?
                            
                                Cannot create virtualenv instance in python 2.7.5 because of pip installation error
                            
                                Converting from RGB to LAB Colorspace - any insight into the range of L*A*B* values?
                            
                                Extract data from JSON API using Python [duplicate]
                            
                                Python/Java script to download all .pdf files from a website
                            
                                Python - Sentiment Analysis using Pointwise Mutual Information
                            
                                Convert integer to binary array with suitable padding
                            
                                How do I find the intersection of two line segments?
                            
                                How to use glob to read limited set of files with numeric names?
                            
                                ValueError resizing an ndarray
                            
                                How do I restrict access to admin pages in Django?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy concurrency strategy

Tags:

python

concurrency

web-scraping

scrapy

Gill Bates

People also ask

1 Answers

Will Madaus

Recent Activity

Donate For Us