I want to use scrapy for crawling web pages. Is there a way to pass the start URL from the terminal itself?
It is given in the documentation that either the name of the spider or the URL can be given, but when i given the url it throws an error:
//name of my spider is example, but i am giving url instead of my spider name(It works fine if i give spider name).
scrapy crawl example.com
ERROR:
File "/usr/local/lib/python2.7/dist-packages/Scrapy-0.14.1-py2.7.egg/scrapy/spidermanager.py", line 43, in create raise KeyError("Spider not found: %s" % spider_name) KeyError: 'Spider not found: example.com'
How can i make scrapy to use my spider on the url given in the terminal??
I'm not really sure about the commandline option. However, you could write your spider like this.
class MySpider(BaseSpider):
name = 'my_spider'
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [kwargs.get('start_url')]
And start it like:
scrapy crawl my_spider -a start_url="http://some_url"
An even easier way to allow multiple url-arguments than what Peter suggested is by giving them as a string with the urls separated by a comma, like this:
-a start_urls="http://example1.com,http://example2.com"
In the spider you would then simply split the string on ',' and get an array of urls:
self.start_urls = kwargs.get('start_urls').split(',')
Use scrapy parse command. You can parse a url with your spider. url is passed from command.
$ scrapy parse http://www.example.com/ --spider=spider-name
http://doc.scrapy.org/en/latest/topics/commands.html#parse
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With