Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy run from script not working

Tags:

python

scrapy

I am trying to run a scrapy spider that runs perfectly using scrapy crall single but I am not able to run it inside a python script.

I am aware that the docs tell how to: https://scrapy.readthedocs.org/en/0.18/topics/practices.html and I also read this already answered question (How to run Scrapy from within a Python script) but I cannot make this work.

The main problem is that the SingleBlogSpider.parse method is never executed, while the start_requests is executed

Here it the code and output from running that script. I also tried to move the execution to a separated file but the same happens.

from urlparse import urlparse
from scrapy.http import Request
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor


class SingleBlogSpider(BaseSpider):
    name = 'single'

    def __init__(self, **kwargs):
        super(SingleBlogSpider, self).__init__(**kwargs)

        url = kwargs.get('url') or kwargs.get('domain') or 'seaofshoes.com'
        if not url.startswith('http://') and not url.startswith('https://'):
            url = 'http://%s/' % url

        self.url = url
        self.allowed_domains = [urlparse(url).hostname.lstrip('www.')]
        self.link_extractor = SgmlLinkExtractor()
        self.cookies_seen = set()

        print 0, self.url

    def start_requests(self):
        print '1', self.url
        return [Request(self.url, callback=self.parse)]

    def parse(self, response):
        print '2'
        # Actual scraper code, that is never executed

if __name__ == '__main__':
    from twisted.internet import reactor
    from scrapy.crawler import Crawler
    from scrapy.settings import Settings
    from scrapy import log, signals

    spider = SingleBlogSpider(domain='scrapinghub.com')

    crawler = Crawler(Settings())
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()

    log.start()
    reactor.run()

Output:

 0 http://scrapinghub.com/
 1 http://scrapinghub.com/
 2013-09-13 14:21:46-0500 [single] INFO: Closing spider (finished)
 2013-09-13 14:21:46-0500 [single] INFO: Dumping Scrapy stats:
     {'downloader/request_bytes': 221,
      'downloader/request_count': 1,
      'downloader/request_method_count/GET': 1,
      'downloader/response_bytes': 9403,
      'downloader/response_count': 1,
      'downloader/response_status_count/200': 1,
      'finish_reason': 'finished',
      'finish_time': datetime.datetime(2013, 9, 13, 19, 21, 46, 563184),
      'response_received_count': 1,
      'scheduler/dequeued': 1,
      'scheduler/dequeued/memory': 1,
      'scheduler/enqueued': 1,
      'scheduler/enqueued/memory': 1,
      'start_time': datetime.datetime(2013, 9, 13, 19, 21, 46, 328961)}
 2013-09-13 14:21:46-0500 [single] INFO: Spider closed (finished)

The program never gets to SingleBlogSpider.parse and prints '2', so it doesn't crawls anything. But as you can see on the output it does makes a request, so not sure what is going one.

Scrapy version == 0.18.2

I really cannot spot the mistake and help is really appreciated.

Thanks!

like image 697
danielfrg Avatar asked Oct 03 '22 23:10

danielfrg


2 Answers

parse() is actually being executed. Just print doesn't show up.

Just to test, put a=b in parse():

def parse(self, response):
    a = b

And, you'll see exceptions.NameError: global name 'b' is not defined.

like image 120
alecxe Avatar answered Oct 11 '22 13:10

alecxe


I believe that when you say you "can't get it working from script" you actually mean "can't get the crawler to generate the output files". It was a bug in the documentation code example. Change your code to this.

if __name__ == '__main__':
    from twisted.internet import reactor
    from scrapy.crawler import Crawler
    from scrapy import log, signals
    from scrapy.utils.project import get_project_settings

    spider = SingleBlogSpider(domain='scrapinghub.com')
    settings = get_project_settings()
    crawler = Crawler(settings)
    crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(spider)
    crawler.start()

    log.start()
    reactor.run()

For further reading take a look in this answer.

like image 44
Medeiros Avatar answered Oct 11 '22 13:10

Medeiros