How to setup and launch a Scrapy spider programmatically (urls and settings)

Tags:

I've written a working crawler using scrapy,
now I want to control it through a Django webapp, that is to say:

Set 1 or several start_urls
Set 1 or several allowed_domains
Set settings values
Start the spider
Stop / pause / resume a spider
retrieve some stats while running
retrive some stats after spider is complete.

At first I thought scrapyd was made for this, but after reading the doc, it seems that it's more a daemon able to manage 'packaged spiders', aka 'scrapy eggs'; and that all the settings (start_urls , allowed_domains, settings ) must still be hardcoded in the 'scrapy egg' itself ; so it doesn't look like a solution to my question, unless I missed something.

I also looked at this question : How to give URL to scrapy for crawling? ; But the best answer to provide multiple urls is qualified by the author himeslf as an 'ugly hack', involving some python subprocess and complex shell handling, so I don't think the solution is to be found here. Also, it may work for start_urls, but it doesn't seem to allow allowed_domains or settings.

Then I gave a look to scrapy webservices : It seems to be the good solution for retrieving stats. However, it still requires a running spider, and no clue to change settings

There are a several questions on this subject, none of them seems satisfactory:

using-one-scrapy-spider-for-several-websites This one seems outdated, as scrapy has evolved a lot since 0.7
creating-a-generic-scrapy-spider No accepted answer, still talking around tweaking shell parameters.

I know that scrapy is used in production environments ; and a tool like scrapyd shows that there are definitvely some ways to handle these requirements (I can't imagine that the scrapy eggs scrapyd is dealing with are generated by hand !)

Thanks a lot for your help.

914

asked Oct 21 '12 10:10

arno

2 Answers

At first I thought scrapyd was made for this, but after reading the doc, it seems that it's more a daemon able to manage 'packaged spiders', aka 'scrapy eggs'; and that all the settings (start_urls , allowed_domains, settings ) must still be hardcoded in the 'scrapy egg' itself ; so it doesn't look like a solution to my question, unless I missed something.

I don't agree to the above statement, start_urls need not be hard-coded they can be dynamically passed to the class, you should be able to pass it as an argument like this

http://localhost:6800/schedule.json -d project=myproject -d spider=somespider -d setting=DOWNLOAD_DELAY=2 -d arg1=val1

Or you should be able to retrieve the URLs from a database or a file. I get it from a database like this

class WikipediaSpider(BaseSpider):     name = 'wikipedia'     allowed_domains = ['wikipedia.com']     start_urls = []      def __init__(self, name=None, url=None, **kwargs):         item = MovieItem()         item['spider'] = self.name         # You can pass a specific url to retrieve          if url:             if name is not None:                 self.name = name             elif not getattr(self, 'name', None):                 raise ValueError("%s must have a name" % type(self).__name__)             self.__dict__.update(kwargs)             self.start_urls = [url]         else:             # If there is no specific URL get it from Database             wikiliks = # < -- CODE TO RETRIEVE THE LINKS FROM DB -->             if wikiliks == None:                 print "**************************************"                 print "No Links to Query"                 print "**************************************"                 return None              for link in wikiliks:                 # SOME PROCESSING ON THE LINK GOES HERE                 self.start_urls.append(urllib.unquote_plus(link[0]))      def parse(self, response):         hxs = HtmlXPathSelector(response)         # Remaining parse code goes here

138

answered Sep 28 '22 17:09

Mridul Augustine

For changing settings programmatically and running the scraper from within an app, here's what I got:

from scrapy.crawler import CrawlerProcess from myproject.spiders import MySpider from scrapy.utils.project import get_project_settings  os.environ['SCRAPY_SETTINGS_MODULE'] = 'myproject.my_settings_module' scrapy_settings = get_project_settings() scrapy_settings.set('CUSTOM_PARAM', custom_vaule) scrapy_settings.set('ITEM_PIPELINES', {})  # don't write jsons or anything like that scrapy_settings.set('DOWNLOADER_MIDDLEWARES', {    'myproject.middlewares.SomeMiddleware': 100, }) process = CrawlerProcess(scrapy_settings) process.crawl(MySpider, start_urls=start_urls) process.start()

answered Sep 28 '22 16:09

Amichai Schreiber

Related questions
                            
                                How to test Smart App Banner Urls on in Dev environment
                            
                                "Proper" way to pull git "production branch" to production server
                            
                                What is a good practice to achieve the "Exactly-once delivery" behavior with Amazon SQS?
                            
                                Getting last of LinkedHashSet
                            
                                What should I pass to unordered_map's bucket count argument if I just want to specify a hash function?
                            
                                What's the "discriminator" in addr2line?
                            
                                code sign error in xcode, no identity found
                            
                                Does adding `noexcept(false)` benefit the code in any way?
                            
                                What is high level modules and low level modules.?
                            
                                Is empty struct defined by C++ standard?
                            
                                How to implement a hierarchy of resources (eg. /parents/<id>/children) in Django REST Framework
                            
                                ejabberd online status when user loses connection

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With