Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Make Scrapy follow links and collect data

I am trying to write program in Scrapy to open links and collect data from this tag: <p class="attrgroup"></p>.

I've managed to make Scrapy collect all the links from given URL but not to follow them. Any help is very appreciated.

like image 836
Arkan Kalu Avatar asked May 10 '15 13:05

Arkan Kalu


People also ask

What does Response follow do in Scrapy?

Response. follow() uses the href attributes automatically. In fact scrapy can handle multiple requests using the follow_all() method. The beauty of this is that follow_all will accept css and xpath directly.


1 Answers

You need to yield Request instances for the links to follow, assign a callback and extract the text of the desired p element in the callback:

# -*- coding: utf-8 -*-
import scrapy


# item class included here 
class DmozItem(scrapy.Item):
    # define the fields for your item here like:
    link = scrapy.Field()
    attr = scrapy.Field()


class DmozSpider(scrapy.Spider):
    name = "dmoz"
    allowed_domains = ["craigslist.org"]
    start_urls = [
    "http://chicago.craigslist.org/search/emd?"
    ]

    BASE_URL = 'http://chicago.craigslist.org/'

    def parse(self, response):
        links = response.xpath('//a[@class="hdrlnk"]/@href').extract()
        for link in links:
            absolute_url = self.BASE_URL + link
            yield scrapy.Request(absolute_url, callback=self.parse_attr)

    def parse_attr(self, response):
        item = DmozItem()
        item["link"] = response.url
        item["attr"] = "".join(response.xpath("//p[@class='attrgroup']//text()").extract())
        return item
like image 149
alecxe Avatar answered Oct 15 '22 02:10

alecxe