Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Restrict scrapy from crawling subdomains

I have about 200 domains that I need to crawl, but I am certain no valuable informationed for me is contain in the subdomains, therefore I would like to exclude them from crawling.

For domain example.com I could use deny rule


but this approach would make me write 200 deny rules for every domain. My question is whether it is possible to create a deny rule for all subdomains of every domain?

Snippet from the spider:

class Spider(CrawlSpider):
    name = "courses"
    start_urls = [

    allowed_domains = ['eb-zuerich.ch',]

    rules = [
                       deny=(r'.+[sS]itemap', r'.+[uU]eber', r'.+[kK]ontakt', r'.+[iI]mpressum',
                        r'.+[lL]ogin', r'.+[dD]ownload[s]?', r'.+[dD]isclaimer',
                        r'.+[nN]ews', r'.+[tT]erm', r'.+[aA]nmeldung.+',
                        r'.+[Aa][Gg][Bb]', r'/en/*', r'\.pdf$')),
         callback='parse_item', follow=True)

    def parse_item(self, response):

        # get soup of the current page
        soup = bs(response.body, 'html.parser')
        page_soup = bs(response.body, 'html.parser')

        # check if it is a course description page
        ex = Extractor(response.url, soup, page_soup)
        is_course = ex.is_course_page()
        if is_course:

I am using Scrapy 1.4.0 and Python 3.6.1

like image 249
Pedro Loureiro Avatar asked Nov 07 '22 19:11

Pedro Loureiro

1 Answers

My question is whether it is possible to create a deny rule for all subdomains of every domain?

With a simplistic approach (ignoring top-level domain names like .co.uk):


like image 115
Gallaecio Avatar answered Dec 18 '22 10:12
