Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing list as arguments in Scrapy

I am trying to build an application using Flask and Scrapy. I have to pass the list of urls to spider. I tried using the following syntax:

__init__: in Spider
self.start_urls = ["http://www.google.com/patents/" + x for x in u]

Flask Method
u = ["US6249832", "US20120095946"]
os.system("rm static/s.json; scrapy crawl patents -d u=%s -o static/s.json" % u)

I know similar thing can be done by reading file having required urls, but can I pass list of urls for crawling?

like image 529
Sumit Gera Avatar asked Feb 16 '15 16:02

Sumit Gera


People also ask

How are arguments passed in Scrapy?

The spider will receive arguments in its constructor. Scrapy puts all the arguments as spider attributes and you can skip the init method completely. Beware use getattr method for getting those attributes so your code does not break. Succinct, robust and flexible!

What is cb_ kwargs?

cb_kwargs. A dictionary that contains arbitrary metadata for this request. Its contents will be passed to the Request's callback as keyword arguments. It is empty for new Requests, which means by default callbacks only get a Response object as argument.

What is crawl in Scrapy?

Scrapy provides a powerful framework for extracting the data, processing it and then save it. Scrapy uses spiders , which are self-contained crawlers that are given a set of instructions [1]. In Scrapy it is easier to build and scale large crawling projects by allowing developers to reuse their code.


1 Answers

Override spider's __init__() method:

class MySpider(Spider):
    name = 'my_spider'    

    def __init__(self, *args, **kwargs): 
      super(MySpider, self).__init__(*args, **kwargs) 

      endpoints = kwargs.get('start_urls').split(',')
      self.start_urls = ["http://www.google.com/patents/" + x for x in endpoints]

And pass the list of endpoints through the -a command line argument:

scrapy crawl patents -a start_urls="US6249832,US20120095946" -o static/s.json

See also:

  • How to give URL to scrapy for crawling?

Note that you can also run Scrapy from script:

  • How to run Scrapy from within a Python script
  • Scrapy Very Basic Example
like image 175
alecxe Avatar answered Sep 20 '22 08:09

alecxe