When I write parse()
function, can I yield both a request and items for one single page?
I want to extract some data in page A and then store the data in database, and extract links to be followed (this can be done by rule in CrawlSpider).
I call the links pages of A pages is B pages, so I can write another parse_item() to extract data from B pages, but I want to extract some links in B pages, so I can only use rule to extract links? how to tackle with the duplicate URLs in Scrapy?
Scrapy uses Request and Response objects for crawling web sites. Typically, Request objects are generated in the spiders and pass across the system until they reach the Downloader, which executes the request and returns a Response object which travels back to the spider that issued the request.
You can use the FormRequest. from_response() method for this job. Here's an example spider which uses it: import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded.
Making a request is a straightforward process in Scrapy. To generate a request, you need the URL of the webpage from which you want to extract useful data. You also need a callback function. The callback function is invoked when there is a response to the request.
Yes, you can yield both requests and items. From what I've seen:
def parse(self, response):
hxs = HtmlXPathSelector(response)
base_url = response.url
links = hxs.select(self.toc_xpath)
for index, link in enumerate(links):
href, text = link.select('@href').extract(), link.select('text()').extract()
yield Request(urljoin(base_url, href[0]), callback=self.parse2)
for item in self.parse2(response):
yield item
I'm not 100% I understand your question but the code below request sites from a starting url using the basespider, then scans the starting url for href's then loops each link calling parse_url. everything matched in parse_url is sent to your item pipeline.
def parse(self, response):
hxs = HtmlXPathSelector(response)
urls = hxs.select('//a[contains(@href, "content")]/@href').extract() ## only grab url with content in url name
for i in urls:
yield Request(urlparse.urljoin(response.url, i[1:]),callback=self.parse_url)
def parse_url(self, response):
hxs = HtmlXPathSelector(response)
item = ZipgrabberItem()
item['zip'] = hxs.select("//div[contains(@class,'odd')]/text()").extract() ## this bitch grabs it
return item
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With