Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - handle exception when one of item fields is not returned

I'm trying to parse Scrapy items, where each of them has several fields. It happens that some of the fields cannot be properly captured due to incomplete information on the site. In case just one of the fields cannot be returned, the entire operation of extracting an item breaks with an exception (e.g. for below code I get "Attribute:None cannot be split"). The parser then moves to next request, without capturing other fields that were available.

item['prodcode'] = response.xpath('//head/title').re_first(r'.....').split(" ")[1]
#throws: Attribute:None cannot be split . Does not parse other fields.

What is the way of handling such exceptions by Scrapy? I would like to retrieve information from all fields that were available, while the unavailable ones return a blank or N/A. I could do try... except... on each of the item fields, but this seems like not the best solution. The docs mention exception handling, but somehow I cannot find a way for this case.

like image 309
Turo Avatar asked Oct 25 '15 16:10

Turo


1 Answers

The most naive approach here would be to follow the EAFP approach and handle exceptions directly in the spider. For instance:

try:
    item['prodcode'] = response.xpath('//head/title').re_first(r'.....').split(" ")[1]
except AttributeError:
    item['prodcode'] = 'n/a'

A better option here could be to delegate the item field parsing logic to Item Loaders and different Input and Output Processors. So that your spider would be only responsible for parsing HTML and extracting the desired data but all of the post-processing and prettifying would be handled by an Item Loader. In other words, in your spider, you would only have:

loader = MyItemLoader(response=response)

# ...
loader.add_xpath("prodcode", "//head/title", re=r'.....')
# ...

loader.load_item()

And the Item Loader would have something like:

def parse_title(title):
    try:
        return title.split(" ")[1]
    except Exception:  # FIXME: handle more specific exceptions
        return 'n/a'

class MyItemLoader(ItemLoader):  
    default_output_processor = TakeFirst()

    prodcode_in = MapCompose(parse_title)
like image 122
alecxe Avatar answered Nov 08 '22 21:11

alecxe