Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pass argument to scrapy spider within a python script

I can run crawl in a python script with the following recipe from wiki :

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings

spider = FollowAllSpider(domain='scrapinghub.com')
settings = get_project_settings()
crawler = Crawler(settings)
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run()

As you can see i can just pass the domain to FollowAllSpider but my question is that how can i pass the start_urls (actually a date that will been added to a Fixed url)to my spider class using above code?

this is my spider class:

class MySpider(CrawlSpider):
    name = 'tw'
    def __init__(self,date):
        y,m,d=date.split('-') #this is a test , it could split with regex! 
        try:
            y,m,d=int(y),int(m),int(d)

        except ValueError:
            raise 'Enter a valid date'

        self.allowed_domains = ['mydomin.com']
        self.start_urls = ['my_start_urls{}-{}-{}'.format(y,m,d)]

    def parse(self, response):
        questions = Selector(response).xpath('//div[@class="result-link"]/span/a/@href') 
        for question in questions:
            item = PoptopItem()
            item['url'] = question.extract()
            yield item['url']

and this is my script :

from pdfcreator import convertor
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
#from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings
from poptop.spiders.stackoverflow_spider import MySpider
from poptop.items import PoptopItem

settings = get_project_settings()
crawler = Crawler(settings) 
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()

date=raw_input('Enter the date with this format (d-m-Y) : ')
print date
spider=MySpider(date=date)
crawler.crawl(spider)
crawler.start()
log.start()
item=PoptopItem()

for url in item['url']:
    convertor(url)

reactor.run() # the script will block here until the spider_closed signal was sent

If i just print the item i'll get the following error :

2015-02-25 17:13:47+0330 [tw] ERROR: Spider must return Request, BaseItem or None, got 'unicode' in <GET test-link2015-1-17>

items:

import scrapy


class PoptopItem(scrapy.Item):
    titles= scrapy.Field()
    content= scrapy.Field()
    url=scrapy.Field()
like image 948
Mazdak Avatar asked Feb 24 '15 20:02

Mazdak


People also ask

How do you run a Scrapy spider from a Python script?

Basic Script The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.

How are arguments passed in Scrapy?

The spider will receive arguments in its constructor. Scrapy puts all the arguments as spider attributes and you can skip the init method completely. Beware use getattr method for getting those attributes so your code does not break. Succinct, robust and flexible!

How do you make a spider in Python?

Creating the Spider Simply drop into a Python shell, import the Spider class, initialize it with your target site, and you're done.

How do you get a Scrapy shell off?

Finally you hit Ctrl-D (or Ctrl-Z in Windows) to exit the shell and resume the crawling: >>> ^D 2014-01-23 17:50:03-0400 [scrapy. core. engine] DEBUG: Crawled (200) <GET http://example.net> (referer: None) ...


1 Answers

You need to modify your __init__() constructor to accept the date argument. Also, I would use datetime.strptime() to parse the date string:

from datetime import datetime

class MySpider(CrawlSpider):
    name = 'tw'
    allowed_domains = ['test.com']

    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs) 

        date = kwargs.get('date')
        if not date:
            raise ValueError('No date given')

        dt = datetime.strptime(date, "%m-%d-%Y")
        self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)]

Then, you would instantiate the spider this way:

spider = MySpider(date='01-01-2015')

Or, you can even avoid parsing the date at all, passing a datetime instance in the first place:

class MySpider(CrawlSpider):
    name = 'tw'
    allowed_domains = ['test.com']

    def __init__(self, *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs) 

        dt = kwargs.get('dt')
        if not dt:
            raise ValueError('No date given')

        self.start_urls = ['http://test.com/{dt.year}-{dt.month}-{dt.day}'.format(dt=dt)]

spider = MySpider(dt=datetime(year=2014, month=01, day=01))

And, just FYI, see this answer as a detailed example about how to run Scrapy from script.

like image 131
alecxe Avatar answered Sep 20 '22 05:09

alecxe