Running scrapy from script not including pipeline

Tags:

I'm running scrapy from a script but all it does is activate the spider. It doesn't go through my item pipeline. I've read http://scrapy.readthedocs.org/en/latest/topics/practices.html but it doesn't say anything about including pipelines.

My setup:

Click to copy

Scraper/
    scrapy.cfg
    ScrapyScript.py
    Scraper/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            my_spider.py

My script:

Click to copy

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from Scraper.spiders.my_spider import MySpiderSpider

spider = MySpiderSpider(domain='myDomain.com')
settings = get_project_settings
crawler = Crawler(Settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
log.msg('Reactor activated...')
reactor.run()
log.msg('Reactor stopped.')

My pipeline:

Click to copy

from scrapy.exceptions import DropItem
from scrapy import log
import sqlite3


class ImageCheckPipeline(object):

    def process_item(self, item, spider):
        if item['image']:
            log.msg("Item added successfully.")
            return item
        else:
            del item
            raise DropItem("Non-image thumbnail found: ")


class StoreImage(object):

    def __init__(self):
        self.db = sqlite3.connect('images')
        self.cursor = self.db.cursor()
        try:
            self.cursor.execute('''
                CREATE TABLE IMAGES(IMAGE BLOB, TITLE TEXT, URL TEXT)
            ''')
            self.db.commit()
        except sqlite3.OperationalError:
            self.cursor.execute('''
                DELETE FROM IMAGES
            ''')
            self.db.commit()

    def process_item(self, item, spider):
        title = item['title'][0]
        image = item['image'][0]
        url = item['url'][0]
        self.cursor.execute('''
            INSERT INTO IMAGES VALUES (?, ?, ?)
        ''', (image, title, url))
        self.db.commit()

Output of the script:

Click to copy

[name@localhost Scraper]$ python ScrapyScript.py
2014-08-06 17:55:22-0400 [scrapy] INFO: Reactor activated...
2014-08-06 17:55:22-0400 [my_spider] INFO: Closing spider (finished)
2014-08-06 17:55:22-0400 [my_spider] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 213,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 18852,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 8, 6, 21, 55, 22, 518492),
     'item_scraped_count': 51,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/memory': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/memory': 1,
     'start_time': datetime.datetime(2014, 8, 6, 21, 55, 22, 363898)}
2014-08-06 17:55:22-0400 [my_spider] INFO: Spider closed (finished)
2014-08-06 17:55:22-0400 [scrapy] INFO: Reactor stopped.
[name@localhost Scraper]$

456

asked Aug 06 '14 21:08

The_SupremeOverlord

1 Answers

You need to actually call get_project_settings, Settings object that you are passing to your crawler in your posted code will give you defaults, not your specific project settings. You need to write something like this:

Click to copy

from scrapy.utils.project import get_project_settings
settings = get_project_settings()
crawler = Crawler(settings)

answered Nov 15 '22 16:11

Pawel Miech

Related questions
                            
                                Virtual classes: doing it right?
                            
                                Python - Call a function in a module dynamically
                            
                                Difference between Systems programming language and Application programming languages
                            
                                PIL: How to make area transparent in PNG?
                            
                                Python console default hex display
                            
                                Commenting/Uncommenting a block of Python code in TextWrangler
                            
                                python iterator through tree with list of children
                            
                                Python library to convert between SI unit prefixes
                            
                                mysql for python 2. 7 says Python v2.7 not found
                            
                                Python: fast way to compute the average of several (same length) lists?
                            
                                Longest strings from list
                            
                                How to handle C++ return type std::vector<int> in Python ctypes?
                            
                                Store a dictionary in a file for later retrieval
                            
                                Flask alternatives to achieve true multi-threading?
                            
                                A Text Table Writer/Printer for Python
                            
                                Posting Data on Flask via form is giving 400 Bad Request
                            
                                Matplotlib cursor value with two axes
                            
                                Simplest way to check if multiple items are (or are not) in a list? [duplicate]
                            
                                Django Allauth Custom Login Does Not Show Errors
                            
                                why does Python lint want me to use different local variable name, than a global, for the same purpose

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Running scrapy from script not including pipeline

Tags:

python

sqlite

python-2.7

scrapy

twisted

The_SupremeOverlord

People also ask

1 Answers

Pawel Miech

Recent Activity

Donate For Us