I am new to Scrapy and what I am trying to do is make a crawler which will only follow the links inside an HTML element on the given start_urls
Just as an example lets say I just want a crawler to go trough the AirBnB listings having start_urls
set to https://www.airbnb.com/s?location=New+York%2C+NY&checkin=&checkout=&guests=1
Instead of crawling all the links in the URL I just want to crawl links inside the xpath //*[@id="results"]
Currently I am using the following code to crawl all the links, how can I adapt it to crawl only //*[@id="results"]
from scrapy.selector import HtmlXPathSelector
from tutorial.items import DmozItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
class BSpider(CrawlSpider):
name = "bt"
#follow = True
allowed_domains = ["mydomain.com"]
start_urls = ["http://myurl.com/path"]
rules =(Rule(SgmlLinkExtractor(allow = ()) ,callback = 'parse_item', follow=True),)
def parse_item(self, response):
{parse code}
Any tip in the right direction will be much appreciated, Thanks!
You can pass a restrict_xpaths keyword argument to SgmlLinkExtractor. From the docs:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With