With items.py
defined:
import scrapy
class CraigslistSampleItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
and populating each item via the spider thus:
item = CraigslistSampleItem()
item["title"] = $someXpath.extract()
item["link"] = $someOtherXpath.extract()
When I append these to a list (returned by parse()) and store this as e.g. a csv, I get two columns of data, title and link, as expected. If I comment out the XPath for link and store as a csv, I still get two columns of data, with the values in the link column being empty strings. This seems reasonable, as both title and link are attributes of every CraigslistSampleItem class. I would think, then, that I could do something like this (with the XPath for link still commented out):
if item["link"] == '':
print "link has not been given a value"
Yet the attempt to get the link attribute for each item fails thus:
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scrapy/item.py", line 50, in __getitem__
return self._values[key]
exceptions.KeyError: 'link'
If each item instance does indeed have a value for link (albeit an empty string), why can't I access this key?
Scrapy Item
class provides a dictionary-like interface for storing the extracted data. There are no default values set for item fields.
To check whether the field was set or not, simply check for the field key in the item instance:
if 'link' not in item:
print "link has not been given a value"
Demo:
In [1]: import scrapy
In [2]: class CraigslistSampleItem(scrapy.Item):
...: title = scrapy.Field()
...: link = scrapy.Field()
...:
In [3]: item = CraigslistSampleItem()
In [4]: item["title"] = "test"
In [5]: item
Out[5]: {'title': 'test'}
In [6]: "link" in item
Out[6]: False
In [7]: item["link"] = "test link"
In [8]: "link" in item
Out[8]: True
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With