I am trying to make a Scrapy broad crawl. The goal is to have many concurrent crawls at different domains but at the same time crawl gently on each domain. And therefore be able to maintain a good crawling speed and keep the request frequency low on each url.
Here is the spider i use:
import re
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from myproject.items import MyprojectItem
class testSpider(CrawlSpider):
name = "testCrawler16"
start_urls = [
"http://example.com",
]
extractor = SgmlLinkExtractor(deny=('.com','.nl','.org'),
allow=('.se'))
rules = (
Rule(extractor,callback='parse_links',follow=True),
)
def parse_links(self, response):
item = MyprojectItem()
item['url'] =response.url
item['depth'] = response.meta['depth']
yield item
And here is the settings i use:
BOT_NAME = 'myproject'
SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'
REACTOR_THREADPOOL_MAXSIZE = 20
RETRY_ENABLED = False
REDIRECT_ENABLED = False
DOWNLOAD_TIMEOUT = 15
LOG_LEVEL = 'INFO'
COOKIES_ENABLED = False
DEPTH_LIMIT = 10
AUTOTHROTTLE_ENABLED = True
CONCURRENT_REQUESTS = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 1
AUTOTHROTTLE_TARGET_CONCURRENCY = 1
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60
The problem is that after a while the crawler crawls less and less concurrently and only alternates between a few domains, sometimes only one. And therefore the Auto throttle slows down the crawl speed. How can i make the spider keep up the concurrency and have many separate connection to many domains and use the concurrency to maintain speed while keeping the request rate low at each domain?
AUTOTHROTTLE_ENABLED
is not recommended for fast crawling, I would recommend setting it to False
, and just crawling gently on your own.
The only settings you need here are CONCURRENT_REQUESTS
and CONCURRENT_REQUESTS_PER_DOMAIN
and DOWNLOAD_DELAY
.
set DOWNLOAD_DELAY
to the value you want for crawling every request per domain, 10
for example if you want 6 requests per minute (one every 10
seconds).
set CONCURRENT_REQUESTS_PER_DOMAIN
to 1 for respecting the previous DOWNLOAD_DELAY
interval per domain.
set CONCURRENT_REQUESTS
to a high value, it could be the number of domains you are planning to crawl (or higher). This is just so it doesn't interfere with the previous settings.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With