I am trying to write program in Scrapy to open links and collect data from this tag: <p class="attrgroup"></p>
.
I've managed to make Scrapy collect all the links from given URL but not to follow them. Any help is very appreciated.
Response. follow() uses the href attributes automatically. In fact scrapy can handle multiple requests using the follow_all() method. The beauty of this is that follow_all will accept css and xpath directly.
You need to yield Request
instances for the links to follow, assign a callback and extract the text of the desired p
element in the callback:
# -*- coding: utf-8 -*-
import scrapy
# item class included here
class DmozItem(scrapy.Item):
# define the fields for your item here like:
link = scrapy.Field()
attr = scrapy.Field()
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["craigslist.org"]
start_urls = [
"http://chicago.craigslist.org/search/emd?"
]
BASE_URL = 'http://chicago.craigslist.org/'
def parse(self, response):
links = response.xpath('//a[@class="hdrlnk"]/@href').extract()
for link in links:
absolute_url = self.BASE_URL + link
yield scrapy.Request(absolute_url, callback=self.parse_attr)
def parse_attr(self, response):
item = DmozItem()
item["link"] = response.url
item["attr"] = "".join(response.xpath("//p[@class='attrgroup']//text()").extract())
return item
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With