Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dynamic spider generation with Scrapy subclass init error

I am trying to write a generic "Master" spider that I use with "start_urls" and "allowed_domains" inserted dynamically during execution. (Eventually, I will have these in a database, that I will pull and then use to initialize and crawl a new spider for each DB entry.)

At the moment, I have two files:

  1. MySpider.py -- Establishes my "master" spider class.
  2. RunSpider.py -- proof of concept for executing the initialization of my dynamically generated spiders.

For writing these two files, I referenced the following:

  • Passing Arguments into spiders at Scrapy.org
  • Running Scrapy from a script at Scrapy.org
  • General Spider structure within Pyton at Scrapy.org
  • These two questions here on StackOverflow were the best help I could find: Creating a generic scrapy spider; Scrapy start_urls

I considered scrapyD, but I don't think its what I'm looking for...

Here is what I have written:

MySpider.py --

import scrapy

class BlackSpider(scrapy.Spider):
    name = 'Black1'

    def __init__(self, allowed_domains=[], start_urls=[], *args, **kwargs):
        super(BlackSpider, self).__init__(*args, **kwargs)
        self.start_urls = start_urls
        self.allowed_domains = allowed_domains
        #For Testing: 
        print start_urls
        print self.start_urls
        print allowed_domains
        print self.allowed_domains

    def parse(self, response):
        #############################
        # Insert my parse code here #
        #############################
        return items

RunSpider.py --

import scrapy
from scrapy.crawler import CrawlerProcess
from MySpider import BlackSpider

#Set my allowed domain (this will come from DB later)
ad = ["example.com"]
#Set my start url
sd = ["http://example.com/files/subfile/dir1"]

#Initialize MySpider with the above allowed domain and start url
MySpider = BlackSpider(ad,sd)

#Crawl MySpider
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()

PROBLEM:

Here is my problem -- When I execute this, it appears to successfully pass in my arguments for allowed_domains and start_urls; HOWEVER, after MySpider is initialized, when I run the spider to crawl, the specified urls / domains are no longer found and no website is crawled. I added the print statement above to show this:

me@mybox:~/$ python RunSpider.py 
['http://example.com/files/subfile/dir1']
['http://example.com/files/subfile/dir1']
['example.com']
['example.com']
2016-02-26 16:11:41 [scrapy] INFO: Scrapy 1.0.5 started (bot: scrapybot)
...
2016-02-26 16:11:41 [scrapy] INFO: Overridden settings: {'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'}
...
[]
[]
[]
[]
2016-02-26 16:11:41 [scrapy] INFO: Spider opened
...
2016-02-26 16:11:41 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
...
2016-02-26 16:11:41 [scrapy] INFO: Closing spider (finished)
...
2016-02-26 16:11:41 [scrapy] INFO: Spider closed (finished)

Why is my spider initialized correctly, but when I try to execute the spider the urls are missing? Is this a basic Python programming (class?) error that I am just missing?

like image 575
JLR Avatar asked Oct 30 '22 08:10

JLR


1 Answers

Please refer to the documentation on CrawlerProcess

  • CrawlerProcess.crawl() expects either a crawler or a scrapy.Spider subclass, not an instance of a Spider
  • spider arguments are to be passed as additional arguments to .crawl()

So you need to do something like this:

import scrapy
from scrapy.crawler import CrawlerProcess
from myspider import BlackSpider

#Set my allowed domain (this will come from DB later)
ad = ["example.com"]
#Set my start url
sd = ["http://example.com/files/subfile/dir1"]

#Crawl MySpider
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
# pass Spider class, and other params as keyword arguments
process.crawl(MySpider, allowed_domains=ad, start_urls=sd)
process.start()

You can see this in action with scrapy commands themselves, for example scrapy runspider:

def run(self, args, opts):
    ...
    spidercls = spclasses.pop()

    self.crawler_process.crawl(spidercls, **opts.spargs)
    self.crawler_process.start()
like image 58
paul trmbrth Avatar answered Nov 09 '22 07:11

paul trmbrth