Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scrapy allow all subdomains

Tags:

python

scrapy

I want to use Scrapy to crawl a website that it's pages are divided into a lot of subdomains I know I need a CrawlSpider with a Rule but I need the Rule to be just "allow all subdomains and let the parsers handle themselves according to the data" (meaning - in the example the item_links are in different subdomains)

example for the code:

def parse_page(self, response):
    sel = Selector(response)
    item_links = sel.xpath("XXXXXXXXX").extract()
    for item_link in item_links:
            item_request = Request(url=item_link,
                                     callback=self.parse_item)
            yield item_request

def parse_item(self, response):
    sel = Selector(response)

** EDIT ** Just to make the question clear, I want the ability to crawl all of *.example.com -> meaning not to get Filtered offsite request to 'foo.example.com'

** ANOTHER EDIT ** Following @agstudy's answer, make sure you don't forget to delete allowed_domains = ["www.example.com"]

like image 264
Boaz Avatar asked Jun 01 '14 13:06

Boaz


2 Answers

If you are not using rules, but are making use of the allowed_domains class attribute of the Spider, you can also set allowed_domains = ['example.com']. That will allow all subdomains of example.com such as foo.example.com.

like image 131
bartaelterman Avatar answered Oct 23 '22 14:10

bartaelterman


You can set an allow_domains list for the rule :

rules = (
       Rule(SgmlLinkExtractor(allow_domains=('domain1','domain2' ), ),)

For example:

rules = (
       Rule(SgmlLinkExtractor(allow_domains=('example.com','example1.com' ), ),)

This will filter allow urls like :

www.example.com/blaa/bla/
www.example1.com/blaa/bla/
www.something.example.com/blaa/bla/
like image 3
agstudy Avatar answered Oct 23 '22 14:10

agstudy