Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can scrapy yield both request and items?

Tags:

python

scrapy

When I write parse() function, can I yield both a request and items for one single page?

I want to extract some data in page A and then store the data in database, and extract links to be followed (this can be done by rule in CrawlSpider).

I call the links pages of A pages is B pages, so I can write another parse_item() to extract data from B pages, but I want to extract some links in B pages, so I can only use rule to extract links? how to tackle with the duplicate URLs in Scrapy?

like image 482
kuafu Avatar asked Dec 30 '12 18:12

kuafu


People also ask

What does Scrapy request return?

Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.

How do you get a response from Scrapy request?

You can use the FormRequest. from_response() method for this job. Here's an example spider which uses it: import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded.

How do I make a Scrapy request?

Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.


2 Answers

Yes, you can yield both requests and items. From what I've seen:

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    base_url = response.url
    links = hxs.select(self.toc_xpath)

    for index, link in enumerate(links):
        href, text = link.select('@href').extract(), link.select('text()').extract()
        yield Request(urljoin(base_url, href[0]), callback=self.parse2)

    for item in self.parse2(response):
        yield item
like image 50
Cacovsky Avatar answered Nov 15 '22 16:11

Cacovsky


I'm not 100% I understand your question but the code below request sites from a starting url using the basespider, then scans the starting url for href's then loops each link calling parse_url. everything matched in parse_url is sent to your item pipeline.

def parse(self, response):
       hxs = HtmlXPathSelector(response)
       urls = hxs.select('//a[contains(@href, "content")]/@href').extract()  ## only grab url with content in url name
       for i in urls:
           yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)


def parse_url(self, response):
   hxs = HtmlXPathSelector(response)
   item = ZipgrabberItem()
   item['zip'] = hxs.select("//div[contains(@class,'odd')]/text()").extract() ## this bitch grabs it
   return item
like image 33
Chris Hawkes Avatar answered Nov 15 '22 14:11

Chris Hawkes