Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - How can I load the project level settings.py while using a script to start the spider

Am trying to implement a scrapy spider which is started using a script as per below code.

from scrapy.crawler import CrawlerRunner 
from scrapy_app.scrapy_app.spiders.generic import GenericSpider
....

class MyProcess(object):

    def start_my_process(self, _config, _req_obj, site_urls):
        runner = CrawlerRunner()       
        runner.crawl(GenericSpider, 
                config=_config, 
                reqObj=_req_obj,
                urls=site_urls)
        deferred = runner.join()
        deferred.addBoth(lambda _: reactor.stop())
        reactor.run()

    ....

So, using a CrawlerRunner, am not receiving the project level settings.py configurations while executing the spider. The Generic spider accepts three parameters in which one is the list of start urls.

How can we load the settings.py to the CrawlerRunner process other than setting custom_settings inside the spider?

like image 986
sadiqmc Avatar asked Oct 31 '18 09:10

sadiqmc


1 Answers

I am going to try to answer this as best I can even though my situation is not 100% identical to yours, however, I was having similar issues.

The typical scrapy project structure looks like this:

scrapy.cfg
myproject/
    __init__.py
    items.py
    middlewares.py
    pipelines.py
    settings.py
    spiders/
        __init__.py
        spider1.py
        spider2.py
        ...

The directory containing the scrapy.cfg file is considered the root directory of the project.

In that file you will see something like this:

[settings]
default: your_project.settings

[deploy]
...

When running your main script that calls on a spider to run with a specific set of settings you should have your main.py script in the same directory as the scrapy.cfg file.

Now from main.py your code is going to have to create a CrawlerProcess or CrawlerRunner instance to run a spider, of which either can be instantiated with a settings object or dict like so:

process = CrawlerProcess(settings={
    'FEED_FORMAT': 'json',
    'FEED_URI': 'items.json'
}) 

---------------------------------------

from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

The dict scenario works but it cumbersome, so the the get_project_settings() call is probably of greater interest which I will expand upon.

I had a large scrapy project that contained multiple spiders that shared a large number of similar settings. So I had a global_settings.py file and then specific settings contained within each spider. Because of the large number of shared settings I liked the idea of keeping everything neat and tidy in one file and not copying and pasting code.

The easiest way I have found after a lot of research is to instantiate the CrawlerProcess/Runner object with the get_project_settings() function, the catch is that get_project_settings uses the default value under [settings] in scrapy.cfg to find project specific settings.

So its important to make sure that for your project the scrapy.cfg settings default value points to your desired settings file when you call get_project_settings().

I'll also add that if you have multiple settings files for multiple scrapy projects and you want to share the root directory you can add those in to scrapy.cfg additionally like so:

[settings]
default = your_project.settings
project1 = myproject1.settings
project2 = myproject2.settings

Adding all these settings to the root directory config file will allow you the opportunity to switch between settings at will in scripts.

As I said before your out of the box call to get_project_settings() will load the default value's settings file for your spider from the scrapy.cfg file (your_project.settings in the example above), however, if you want to change the settings used for the next spider run in the same process you can modify the settings loaded for that spider to be started.

This is slightly tricky and "hackey" but it has worked for me...

After the first call to get_project_settings() an environment variable called SCRAPY_SETTINGS_MODULE will be set. This environment variable value will be set to whatever your default value was in the scrapy.cfg file. To alter the settings used to subsequent spiders that are run in the process instance created (CrawlerRunner/Process --> process.crawl('next_spider_to_start')), this variable will need to be manipulated.

This is what should be done to set a new settings module on a current process instance that previously had get_project_settings() instantiated with it:

import os    

# Clear the old settings module
del os.environ['SCRAPY_SETTINGS_MODULE']

# Set the project environment variable (new set of settings), this should be a value in your scrapy.cfg
os.environ['SCRAPY_PROJECT'] = 'project2'

# Call get_project_settings again and set to process object
process.settings = get_project_settings()

# Run the next crawler with the updated settings module
process.crawl('next_spider_to_start')

get_project_settings() just updated the current process settings (Twisted Reactor) to myproject2.settings for your crawler process instance.

This can all be done from a main script to manipulate spiders and the settings for them. Like I said previously though, I found it easier to just have a global settings file with all the commonalities, and then spider specific settings set in the spiders themselves. This is usually much clearer.

Scrapy docs are kinda rough, hope this helps someone...

like image 87
Fury Avatar answered Nov 15 '22 12:11

Fury