Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy Spider returns None instead of Item

I've found the answer, down below. In short, wrong Indentation in the ItemPipeline caused it to return None.

I've been trying to write a CrawlSpider in Scrapy, having never worked with python before. The Spider crawls,calls the callback function, extracts data and fills the item, but it always returns None. I've tested it with a print article call, everything was in order. I have tried this both with yield and return ( though I still don't understand the difference). Frankly, I'm out of ideas. Down below is the callback function.//edit added the spider code as well

class ZeitSpider(CrawlSpider):
name= xxxx
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/%d/%d' %(JAHR,39)]
rules = (Rule(SgmlLinkExtractor(restrict_xpaths=('//ul[@class="teaserlist"]/li[@class="archiveteaser"]/h4[@class="title"]')),callback='parse_url',follow=True),)


    def parse_url(self,response):
        hxs = HtmlXPathSelector(response)

        article = Article()

        article['url']= response.url.encode('UTF-8',errors='strict')

        article['author']= hxs.select('//div[@id="informatives"]/ul[@class="tools"]/li[@class="author first"]/text()').extract().pop().encode('UTF-8',errors='strict')
        article['title']= hxs.select('//div[@class="articleheader"]/h1/span[@class="title"]/text()').extract().pop().encode('UTF-8',errors='strict')

        article['text']= hxs.select('//div[@id="main"]/p/text()').extract().pop().encode('UTF-8',errors='strict')

        article['excerpt'] = hxs.select('//p[@class="excerpt"]/text()').extract().pop().encode('UTF-8',errors='strict')
        yield article

and the item definition

class Article(Item):
    url=Field()
    author=Field()
    text=Field()
    title=Field()
    excerpt=Field()
like image 265
thegermanpole Avatar asked Oct 22 '22 07:10

thegermanpole


1 Answers

Ok, after stepping through the program with pdb I found the error:

Because I have multiple spiders, I wanted to write multiple ItemPipelines. To make them differentiate per Spider, I added an

if spider.name=='SpiderName'
    return item

Notice the Indentation. The Pipeline returned Nothing, and so the output became None.

After changing the Indentation, the spider worked flawlessly. Another example of PEBCAC .

like image 96
thegermanpole Avatar answered Nov 02 '22 23:11

thegermanpole