I want to use Scrapy to crawl a website that it's pages are divided into a lot of subdomains
I know I need a CrawlSpider
with a Rule
but I need the Rule to be just "allow all subdomains and let the parsers handle themselves according to the data" (meaning - in the example the item_links are in different subdomains)
example for the code:
def parse_page(self, response):
sel = Selector(response)
item_links = sel.xpath("XXXXXXXXX").extract()
for item_link in item_links:
item_request = Request(url=item_link,
callback=self.parse_item)
yield item_request
def parse_item(self, response):
sel = Selector(response)
** EDIT **
Just to make the question clear, I want the ability to crawl all of *.example.com ->
meaning not to get Filtered offsite request to 'foo.example.com'
** ANOTHER EDIT **
Following @agstudy's answer, make sure you don't forget to delete allowed_domains = ["www.example.com"]
If you are not using rules, but are making use of the allowed_domains
class attribute of the Spider, you can also set allowed_domains = ['example.com']
. That will allow all subdomains of example.com
such as foo.example.com
.
You can set an allow_domains
list for the rule :
rules = (
Rule(SgmlLinkExtractor(allow_domains=('domain1','domain2' ), ),)
For example:
rules = (
Rule(SgmlLinkExtractor(allow_domains=('example.com','example1.com' ), ),)
This will filter allow urls like :
www.example.com/blaa/bla/
www.example1.com/blaa/bla/
www.something.example.com/blaa/bla/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With