What's the best approach to write contracts for Scrapy spiders that have more than one method to parse the response? I saw this answer but it didn't sound very clear to me.
My current example: I have a method called parse_product
that extracts the information on a page but I have more data that I need to extract for the same product in another page, so I yield
a new request at the end of this method to make a new request and let the new callback extracts theses fields and returns the item.
The problem is that if I write a contract for the second method, it will fail because it doesn't have the meta attribute (containing the item with most of the fields). If I write a contract for the first method, I can't check if it returns the fields, because it returns a new request, instead of the item.
def parse_product(self, response):
il = ItemLoader(item=ProductItem(), response=response)
# populate the item in here
# yield the new request sending the ItemLoader to another callback
yield scrapy.Request(new_url, callback=self.parse_images, meta={'item': il})
def parse_images(self, response):
"""
@url http://foo.bar
@returns items 1 1
@scrapes field1 field2 field3
"""
il = response.request.meta['item']
# extract the new fields and add them to the item in here
yield il.load_item()
In the example, I put the contract in the second method, but it gave me a KeyError
exception on response.request.meta['item']
, also, the fields field1
and field2
are populated in the first method.
Hope it's clear enough.
Frankly, I don't use Scrapy contracts and I don't really recommend anyone to use them either. They have many issues and someday may be removed from Scrapy.
In practice, I haven't had much luck using unit tests for spiders.
For testing spiders during development, I'd enable the cache and then re-run the spider as many times as needed to get the scraping right.
For regression bugs, I had better luck using item pipelines (or spider middlewares) that do validation on-the-fly (there is only so much you can catch in early testing anyway). It's also a good idea to have some strategies for recovering.
And for maintaining a healthy codebase, I'd be constantly moving library-like code out from the spider itself to make it more testable.
Sorry if this isn't the answer you're looking for.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With