Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can i use multiple requests and pass items in between them in scrapy python

Tags:

python

scrapy

I have the item object and i need to pass that along many pages to store data in single item

LIke my item is

class DmozItem(Item):     title = Field()     description1 = Field()     description2 = Field()     description3 = Field() 

Now those three description are in three separate pages. i want to do somrething like

Now this works good for parseDescription1

def page_parser(self, response):     sites = hxs.select('//div[@class="row"]')     items = []     request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription1)     request.meta['item'] = item     return request   def parseDescription1(self,response):     item = response.meta['item']     item['desc1'] = "test"     return item 

But i want something like

def page_parser(self, response):     sites = hxs.select('//div[@class="row"]')     items = []     request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription1)     request.meta['item'] = item      request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)     request.meta['item'] = item      request =  Request("http://www.example.com/lin1.cpp",  callback =self.parseDescription2)     request.meta['item'] = item      return request   def parseDescription1(self,response):     item = response.meta['item']     item['desc1'] = "test"     return item  def parseDescription2(self,response):     item = response.meta['item']     item['desc2'] = "test2"     return item  def parseDescription3(self,response):     item = response.meta['item']     item['desc3'] = "test3"     return item 
like image 770
user1858027 Avatar asked Dec 17 '12 08:12

user1858027


People also ask

How do I pass parameters in Scrapy request?

It is an old topic, but for anyone who needs it, to pass an extra parameter you must use cb_kwargs , then call the parameter in the parse method. You can refer to this part of the documentation.

How do you use Scrapy request in Python?

Python scrapy. Request(). You can vote up the ones you like or vote down the ones you don't like, and go to the original project or source file by following the links above each example. You may also want to check out all available functions/classes of the module scrapy , or try the search function .

What is callback in Scrapy?

In the callback function, you parse the response (web page) and return either Item objects, Request objects, or an iterable of both. Those Requests will also contain a callback (maybe the same) and will then be downloaded by Scrapy and then their response handled by the specified callback.


2 Answers

No problem. Following is correct version of your code:

def page_parser(self, response):       sites = hxs.select('//div[@class="row"]')       items = []        request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)       request.meta['item'] = item       yield request        request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription2, meta={'item': item})       yield request        yield Request("http://www.example.com/lin1.cpp", callback=self.parseDescription3, meta={'item': item})  def parseDescription1(self,response):             item = response.meta['item']             item['desc1'] = "test"             return item  def parseDescription2(self,response):             item = response.meta['item']             item['desc2'] = "test2"             return item  def parseDescription3(self,response):             item = response.meta['item']             item['desc3'] = "test3"             return item 
like image 200
warvariuc Avatar answered Sep 21 '22 20:09

warvariuc


In order to guarantee an ordering of the requests/callbacks and that only one item is ultimately returned you need to chain your requests using a form like:

  def page_parser(self, response):         sites = hxs.select('//div[@class="row"]')         items = []          request = Request("http://www.example.com/lin1.cpp", callback=self.parseDescription1)         request.meta['item'] = Item()         return [request]     def parseDescription1(self,response):         item = response.meta['item']         item['desc1'] = "test"         return [Request("http://www.example.com/lin2.cpp", callback=self.parseDescription2, meta={'item': item})]     def parseDescription2(self,response):         item = response.meta['item']         item['desc2'] = "test2"         return [Request("http://www.example.com/lin3.cpp", callback=self.parseDescription3, meta={'item': item})]    def parseDescription3(self,response):         item = response.meta['item']         item['desc3'] = "test3"         return [item] 

Each callback function returns an iterable of items or requests, requests are scheduled and items are run through your item pipeline.

If you return an item from each of the callbacks, you'll end up with 4 items in various states of completeness in your pipeline, but if you return the next request, then you can guaruntee the order of requests and that you will have exactly one item at the end of execution.

like image 22
Dave McLain Avatar answered Sep 20 '22 20:09

Dave McLain