I am trying to build an application using Flask and Scrapy. I have to pass the list
of urls to spider. I tried using the following syntax:
__init__: in Spider
self.start_urls = ["http://www.google.com/patents/" + x for x in u]
Flask Method
u = ["US6249832", "US20120095946"]
os.system("rm static/s.json; scrapy crawl patents -d u=%s -o static/s.json" % u)
I know similar thing can be done by reading file having required urls, but can I pass list of urls for crawling?
The spider will receive arguments in its constructor. Scrapy puts all the arguments as spider attributes and you can skip the init method completely. Beware use getattr method for getting those attributes so your code does not break. Succinct, robust and flexible!
cb_kwargs. A dictionary that contains arbitrary metadata for this request. Its contents will be passed to the Request's callback as keyword arguments. It is empty for new Requests, which means by default callbacks only get a Response object as argument.
Scrapy provides a powerful framework for extracting the data, processing it and then save it. Scrapy uses spiders , which are self-contained crawlers that are given a set of instructions [1]. In Scrapy it is easier to build and scale large crawling projects by allowing developers to reuse their code.
Override spider's __init__()
method:
class MySpider(Spider):
name = 'my_spider'
def __init__(self, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
endpoints = kwargs.get('start_urls').split(',')
self.start_urls = ["http://www.google.com/patents/" + x for x in endpoints]
And pass the list of endpoints through the -a
command line argument:
scrapy crawl patents -a start_urls="US6249832,US20120095946" -o static/s.json
See also:
Note that you can also run Scrapy from script:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With