Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I tell Scrapy to only crawl links inside an Xpath?

I am new to Scrapy and what I am trying to do is make a crawler which will only follow the links inside an HTML element on the given start_urls

Just as an example lets say I just want a crawler to go trough the AirBnB listings having start_urls set to https://www.airbnb.com/s?location=New+York%2C+NY&checkin=&checkout=&guests=1

Instead of crawling all the links in the URL I just want to crawl links inside the xpath //*[@id="results"]

Currently I am using the following code to crawl all the links, how can I adapt it to crawl only //*[@id="results"]

    from scrapy.selector import HtmlXPathSelector
    from tutorial.items import DmozItem
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector


    class BSpider(CrawlSpider):
            name = "bt"
            #follow = True
            allowed_domains = ["mydomain.com"]
            start_urls = ["http://myurl.com/path"]
            rules =(Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item', follow=True),)


        def parse_item(self, response):
        {parse code}

Any tip in the right direction will be much appreciated, Thanks!

like image 411
JordanBelf Avatar asked Dec 25 '12 06:12

JordanBelf


1 Answers

You can pass a restrict_xpaths keyword argument to SgmlLinkExtractor. From the docs:

  • restrict_xpaths (str or list) – is a XPath (or list of XPath’s) which defines regions inside the response where links should be extracted from. If given, only the text selected by those XPath will be scanned for links.
like image 66
Shane Evans Avatar answered Nov 01 '22 02:11

Shane Evans