Getting scrapy project settings when script is outside of root directory

Tags:

I have made a Scrapy spider that can be successfully run from a script located in the root directory of the project. As I need to run multiple spiders from different projects from the same script (this will be a django app calling the script upon the user's request), I moved the script from the root of one of the projects to the parent directory. For some reason, the script is no longer able to get the project's custom settings in order to pipeline the scraped results into the database tables. Here is the code from the scrapy docs I'm using to run the spider from a script:

def spiderCrawl():
   settings = get_project_settings()
   settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
   process = CrawlerProcess(settings)
   process.crawl(MySpider3)
   process.start()

Is there some extra module that needs to be imported in order to get the project settings from outside of the project? Or does there need to be some additions made to this code? Below I also have the code for the script running the spiders, thanks.

from ticket_city_scraper.ticket_city_scraper import *
from ticket_city_scraper.ticket_city_scraper.spiders import tc_spider
from vividseats_scraper.vividseats_scraper import *
from vividseats_scraper.vividseats_scraper.spiders import vs_spider 

tc_spider.spiderCrawl()
vs_spider.spiderCrawl()

637

asked Jul 27 '15 20:07

loremIpsum1771

1 Answers

Thanks to some of the answers already provided here, I realised scrapy wasn't actually importing the settings.py file. This is how I fixed it.

TLDR: Make sure you set the 'SCRAPY_SETTINGS_MODULE' variable to your actual settings.py file. I'm doing this in the __init__() func of Scraper.

Consider a project with the following structure.

my_project/
    main.py                 # Where we are running scrapy from
    scraper/
        run_scraper.py               #Call from main goes here
        scrapy.cfg                   # deploy configuration file
        scraper/                     # project's Python module, you'll import your code from here
            __init__.py
            items.py                 # project items definition file
            pipelines.py             # project pipelines file
            settings.py              # project settings file
            spiders/                 # a directory where you'll later put your spiders
                __init__.py
                quotes_spider.py     # Contains the QuotesSpider class

Basically, the command scrapy startproject scraper was executed in the my_project folder, I've added a run_scraper.py file to the outer scraper folder, a main.py file to my root folder, and quotes_spider.py to the spiders folder.

My main file:

from scraper.run_scraper import Scraper
scraper = Scraper()
scraper.run_spiders()

My run_scraper.py file:

from scraper.scraper.spiders.quotes_spider import QuotesSpider
from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings
import os


class Scraper:
    def __init__(self):
        settings_file_path = 'scraper.scraper.settings' # The path seen from root, ie. from main.py
        os.environ.setdefault('SCRAPY_SETTINGS_MODULE', settings_file_path)
        self.process = CrawlerProcess(get_project_settings())
        self.spider = QuotesSpider # The spider you want to crawl

    def run_spiders(self):
        self.process.crawl(self.spider)
        self.process.start()  # the script will block here until the crawling is finished

Also, note that the settings might require a look-over, since the path needs to be according to the root folder (my_project, not scraper). So in my case:

SPIDER_MODULES = ['scraper.scraper.spiders']
NEWSPIDER_MODULE = 'scraper.scraper.spiders'

And repeat for all the settings variables you have!

answered Sep 22 '22 09:09

malla

Related questions
                            
                                Python - how can I dynamically remove a method from a class -- i.e. opposite of setattr
                            
                                db.ReferenceProperty() vs ndb.KeyProperty in App Engine
                            
                                get intersection of list of sets
                            
                                argparse subcommands with nested namespaces
                            
                                Grouping daily data by month in python/pandas and then normalizing
                            
                                Changing an element in one list changes multiple lists [duplicate]
                            
                                Str.format() for Python 2.6 gives error where 2.7 does not
                            
                                Setting the limits on a colorbar of a contour plot
                            
                                Python color map but with all zero values mapped to black
                            
                                Open files in "rock&roll" mode
                            
                                Python argparse AssertionError
                            
                                Is cube root integer?
                            
                                how to remove positive infinity from numpy array...if it is already converted to a number?
                            
                                What does calling Tk() actually do?
                            
                                Reading emails with imaplib - "Got more than 10000 bytes" error
                            
                                Runtime difference between set.discard and set.remove methods in Python?
                            
                                How to pass arbitrary arguments to a flask blueprint?
                            
                                django-allauth social account connect to existing account on login
                            
                                Python mock patch doesn't work as expected for public method
                            
                                Python Kivy: Align text to the left side of a Label

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Getting scrapy project settings when script is outside of root directory

Tags:

python

django

web-scraping

scrapy

loremIpsum1771

People also ask

1 Answers

malla

Recent Activity

Donate For Us