Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performing a Scrapy broad crawl with high concurrency and low request rate on each domain.

I am trying to make a Scrapy broad crawl. The goal is to have many concurrent crawls at different domains but at the same time crawl gently on each domain. And therefore be able to maintain a good crawling speed and keep the request frequency low on each url.

Here is the spider i use:

import re
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from myproject.items import MyprojectItem

class testSpider(CrawlSpider):
    name = "testCrawler16"
    start_urls = [
              "http://example.com",
    ]

    extractor = SgmlLinkExtractor(deny=('.com','.nl','.org'),
                              allow=('.se'))

    rules = (
        Rule(extractor,callback='parse_links',follow=True),
        )

    def parse_links(self, response):
        item = MyprojectItem()
        item['url'] =response.url
        item['depth'] = response.meta['depth']
        yield item

And here is the settings i use:

BOT_NAME = 'myproject'

SPIDER_MODULES = ['myproject.spiders']
NEWSPIDER_MODULE = 'myproject.spiders'

REACTOR_THREADPOOL_MAXSIZE = 20
RETRY_ENABLED = False
REDIRECT_ENABLED = False
DOWNLOAD_TIMEOUT = 15
LOG_LEVEL = 'INFO'
COOKIES_ENABLED = False
DEPTH_LIMIT = 10


AUTOTHROTTLE_ENABLED = True
CONCURRENT_REQUESTS = 10
CONCURRENT_REQUESTS_PER_DOMAIN = 1
AUTOTHROTTLE_TARGET_CONCURRENCY = 1
AUTOTHROTTLE_START_DELAY = 5
AUTOTHROTTLE_MAX_DELAY = 60

The problem is that after a while the crawler crawls less and less concurrently and only alternates between a few domains, sometimes only one. And therefore the Auto throttle slows down the crawl speed. How can i make the spider keep up the concurrency and have many separate connection to many domains and use the concurrency to maintain speed while keeping the request rate low at each domain?

like image 896
codeer Avatar asked Oct 18 '22 09:10

codeer


1 Answers

AUTOTHROTTLE_ENABLED is not recommended for fast crawling, I would recommend setting it to False, and just crawling gently on your own.

The only settings you need here are CONCURRENT_REQUESTS and CONCURRENT_REQUESTS_PER_DOMAIN and DOWNLOAD_DELAY.

set DOWNLOAD_DELAY to the value you want for crawling every request per domain, 10 for example if you want 6 requests per minute (one every 10 seconds).

set CONCURRENT_REQUESTS_PER_DOMAIN to 1 for respecting the previous DOWNLOAD_DELAY interval per domain.

set CONCURRENT_REQUESTS to a high value, it could be the number of domains you are planning to crawl (or higher). This is just so it doesn't interfere with the previous settings.

like image 144
eLRuLL Avatar answered Nov 09 '22 14:11

eLRuLL