Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

(Python 3) Spider must return Request, BaseItem, dict or None, got 'generator'

I am working on a scrapy script to pull the most recent blog posts from Paul Krugman's NYT blog. The project is proceeding along nicely, however when I get to the stage where I actually attempt to extract the data I keep getting the same issue:

ERROR: Spider must return Request, BaseItem, dict or None, got 'generator' in <GET https://krugman.blogs.nytimes.com/more_posts_jsons/page/1/?homepage=1&apagenum=1>

The code I am working with is as follows:

from scrapy import http
from scrapy.selector import Selector
from scrapy.spiders import CrawlSpider
import scrapy
from tutorial.items import BlogPost


class krugSpider(CrawlSpider):
    name = 'krugbot'
    start_url = ['https://krugman.blogs.nytimes.com']

    def __init__(self):
        self.url = 'https://krugman.blogs.nytimes.com/more_posts_jsons/page/{0}/?homepage=1&apagenum={0}'

    def start_requests(self):
        yield http.Request(self.url.format('1'), callback = self.parse_page)

    def parse_page(self, response):
        data = json.loads(response.body)
        for block in range(len(data['posts'])):
            yield self.parse_block(data['posts'][block])

        page = data['args']['paged'] + 1
        url = self.url.format(str(page))
        yield http.Request(url, callback = self.parse_page)


    def parse_block(self, block):
        for content in block:
            article = BlogPost(author = 'Paul Krugman', source = 'Blog')

            paragraphs = Selector(text = content['html'])

                article['paragraphs']= paragraphs.xpath('article/p').extract()
                article['datetime'] = content['post_date']
                article['post_id'] = content['post_id']
                article['url'] = content['permalink']
                article['title'] = content['headline']

            yield article

for reference, the items.py file is:

from scrapy import Item, Field

class BlogPost(Item):
    author = Field()
    source = Field()
    datetime = Field()
    url = Field()
    post_id = Field()
    title = Field()
    paragraph = Field()

The program should be return scrapy 'Item' class objects and non generators, so I'm unsure why it is returning a generator. Any advice?

like image 428
Josh Kraushaar Avatar asked Sep 11 '17 17:09

Josh Kraushaar


2 Answers

Instead of iterating over self.parse_block(data['posts'][block]) and yielding each item, as in the accepted answer, I believe you can also use yield from as in:

yield from self.parse_block(data['posts'][block])
like image 195
Dustin Michels Avatar answered Oct 23 '22 09:10

Dustin Michels


this is because you are yielding a generator inside parse_page. Check that this line:

yield self.parse_block(data['posts'][block])

yields the output of parse_block, and parse_block returns an generator (so it also yields multiple objects).

It should work if you change it to:

for block in range(len(data['posts'])):
    for article in self.parse_block(data['posts'][block]):
        yield article
like image 8
eLRuLL Avatar answered Oct 23 '22 10:10

eLRuLL