How can I tell Scrapy to only crawl links inside an Xpath?

Question

I am new to Scrapy and what I am trying to do is make a crawler which will only follow the links inside an HTML element on the given start_urls

Just as an example lets say I just want a crawler to go trough the AirBnB listings having start_urls set to https://www.airbnb.com/s?location=New+York%2C+NY&checkin=&checkout=&guests=1

Instead of crawling all the links in the URL I just want to crawl links inside the xpath //*[@id="results"]

Currently I am using the following code to crawl all the links, how can I adapt it to crawl only //*[@id="results"]

    from scrapy.selector import HtmlXPathSelector
    from tutorial.items import DmozItem
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector


    class BSpider(CrawlSpider):
            name = "bt"
            #follow = True
            allowed_domains = ["mydomain.com"]
            start_urls = ["http://myurl.com/path"]
            rules =(Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item', follow=True),)


        def parse_item(self, response):
        {parse code}

Any tip in the right direction will be much appreciated, Thanks!

Shane Evans · Accepted Answer

You can pass a restrict_xpaths keyword argument to SgmlLinkExtractor. From the docs:

restrict_xpaths (str or list) – is a XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links.

How can I tell Scrapy to only crawl links inside an Xpath?

Tags:

python

web-scraping

scrapy

JordanBelf

1 Answers

Shane Evans

Recent Activity

Donate For Us

How can I tell Scrapy to only crawl links inside an Xpath?

Tags:

python

web-scraping

scrapy

JordanBelf

1 Answers

Shane Evans

Related questions

Recent Activity

Donate For Us