Scrapy Crawling Speed is Slow (60 pages / min)

Tags:

I am experiencing slow crawl speeds with scrapy (around 1 page / sec). I'm crawling a major website from aws servers so I don't think its a network issue. Cpu utilization is nowhere near 100 and if I start multiple scrapy processes crawl speed is much faster.

Scrapy seems to crawl a bunch of pages, then hangs for several seconds, and then repeats.

I've tried playing with: CONCURRENT_REQUESTS = CONCURRENT_REQUESTS_PER_DOMAIN = 500

but this doesn't really seem to move the needle past about 20.

222

asked Nov 22 '12 02:11

somewire

1 Answers

Are you sure you are allowed to crawl the destination site at high speed? Many sites implement download threshold and "after a while" start responding slowly.

126

answered Sep 18 '22 14:09

gvtech

Related questions
                            
                                How do I create a bounded memoization decorator in Python?
                            
                                Creating dynamic docstrings in Python descriptor
                            
                                Can we know if a Python script is launched from Windows or a textual terminal?
                            
                                How to get status of uploading file in Flask
                            
                                iPhone camera and OpenCV
                            
                                Including static data in setup.py (setuptools)
                            
                                python search with image google images
                            
                                How to wrap C structs in Cython for use in Python?
                            
                                Naming for Python installations in Unix and good use of the shebang
                            
                                Can I use OS X 10.8's speech recognition/dictation without a GUI?
                            
                                udp rate limits in python?
                            
                                python, matplotlib, svg and hyperlinks in text labels
                            
                                Python 3: Monkey-patched code not re-importable by multiprocessing
                            
                                Invoking `super` in classmethod called from metaclass.__new__
                            
                                How to list, add and remove repositories with yum python API?
                            
                                Initializing and destroying Python multiprocessing workers
                            
                                Get self.__module__ of class regardless of how imported
                            
                                how to give input to a trained and tested PyBrain network and how to get the result
                            
                                Django: Implementing a nested, reusable component design
                            
                                Execute Javascript method on web page from Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scrapy Crawling Speed is Slow (60 pages / min)

Tags:

python

http

scrapy

web-crawler

somewire

People also ask

1 Answers

gvtech

Recent Activity

Donate For Us