Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: Passing item between methods

Tags:

python

scrapy

Suppose I have a Bookitem, I need to add information to it in both the parse phase and detail phase

def parse(self, response)
    data = json.loads(response)
    for book in data['result']:
        item = BookItem();
        item['id'] = book['id']
        url = book['url']
        yield Request(url, callback=self.detail)

def detail(self,response):        
    hxs = HtmlXPathSelector(response)
    item['price'] = ......
#I want to continue the same book item as from the for loop above

Using the code as is would led to undefined item in the detail phase. How can I pass the item to the detail? detail(self,response,item) doesn't seem to work.

like image 276
Dionysian Avatar asked Dec 18 '13 16:12

Dionysian


3 Answers

There is an argument named meta for Request:

yield Request(url, callback=self.detail, meta={'item': item})

then in function detail, access it this way:

item = response.meta['item']

See more details here about jobs topic.

like image 107
iMom0 Avatar answered Oct 22 '22 09:10

iMom0


You can define variable in init method:

class MySpider(BaseSpider):
    ...

    def __init__(self):
        self.item = None

    def parse(self, response)
        data = json.loads(response)
        for book in data['result']:
            self.item = BookItem();
            self.item['id'] = book['id']
            url = book['url']
            yield Request(url, callback=self.detail)

    def detail(self, response):        
        hxs = HtmlXPathSelector(response)
        self.item['price'] = ....
like image 2
greg Avatar answered Oct 22 '22 08:10

greg


iMom0's approach still works, but as of scrapy 1.7, the recommended approach is to pass user-defined information through cb_kwargs and leave meta for middlewares, extensions, etc:

def parse(self, response):
   ....
   yield Request(url, callback=self.detail, cb_kwargs={'item': item})

def detail(self,response, item): 
  item['price'] = ......

You could also pass the individual key-values into the cb_kwargs argument and then only instantiate the BookItem instance in the final callback (detail in this case):

def parse(self, response)
    data = json.loads(response)
    for book in data['result']:
        yield Request(url, 
                      callback=self.detail, 
                      cb_kwargs=dict(id_=book['id'], 
                                     url=book['url']))

def detail(self,response, id_, url):        
    hxs = HtmlXPathSelector(response)
    item = BookItem()
    item['id'] = id_
    item['url'] = url
    item['price'] = ......

like image 2
tbrk Avatar answered Oct 22 '22 09:10

tbrk