Dynamically add to allowed_domains in a Scrapy spider

Question

I have a spider that starts with a small list of allowed_domains at the beginning of the spidering. I need to add more domains dynamically to this whitelist as the spidering continues from within a parser, but the following piece of code does not get that accomplished since subsequent requests are still being filtered. Is there another of updating allowed_domains within the parser?

class APSpider(BaseSpider):
name = "APSpider"

allowed_domains = ["www.somedomain.com"]

start_urls = [
    "http://www.somedomain.com/list-of-websites",
]

...

def parse(self, response):
    soup = BeautifulSoup( response.body )

    for link_tag in soup.findAll('td',{'class':'half-width'}):
        _website = link_tag.find('a')['href']
        u = urlparse.urlparse(_website)
        self.allowed_domains.append(u.netloc)

        yield Request(url=_website, callback=self.parse_secondary_site)

...

starrify · Accepted Answer

(At the very moment when this answer is written, the latest version of scrapy is 1.0.3. This answer shall work for all recent versions of scrapy)

As the OffsiteMiddleware reads the content in allowed_domains only when initializing the precompiled regex object while handling the spider_opened signal, values in allowed_domains are never accessed later.
Thus simply updating the content of allowed_domains shall not solve the problem.

Basically, two steps are required:

Update the content of allowed_domains according to your actual need.
Have the regex cache in OffsiteMiddleware refreshed.

Here is the code I use for step #2:

# Refresh the regex cache for `allowed_domains`
for mw in self.crawler.engine.scraper.spidermw.middlewares:
    if isinstance(mw, scrapy.spidermiddlewares.offsite.OffsiteMiddleware):
        mw.spider_opened(self)

The code above is supposed to be invoked inside a response callback, thus self here shall be an instance of the spider class.

See also:

Source code of scrapy.spidermiddlewares.offsite.OffsiteMiddleware on GitHub

pjob · Answer

You could try something like the following:

class APSpider(BaseSpider):
name = "APSpider"

start_urls = [
    "http://www.somedomain.com/list-of-websites",
]

def __init__(self):
    self.allowed_domains = None

def parse(self, response):
    soup = BeautifulSoup( response.body )

    if not self.allowed_domains:
        for link_tag in soup.findAll('td',{'class':'half-width'}):
            _website = link_tag.find('a')['href']
            u = urlparse.urlparse(_website)
            self.allowed_domains.append(u.netloc)

            yield Request(url=_website, callback=self.parse_secondary_site)

    if response.url in self.allowed_domains:
        yield Request(...)

...

Dynamically add to allowed_domains in a Scrapy spider

Tags:

python

scrapy

screen-scraping

Penang

2 Answers

starrify

pjob

Recent Activity

Donate For Us

Dynamically add to allowed_domains in a Scrapy spider

Tags:

python

scrapy

screen-scraping

Penang

2 Answers

starrify

pjob

Related questions

Recent Activity

Donate For Us