Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Passing arguments to process.crawl in Scrapy python

I would like to get the same result as this command line : scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json

My script is as follows :

import scrapy from linkedin_anonymous_spider import LinkedInAnonymousSpider from scrapy.crawler import CrawlerProcess from scrapy.utils.project import get_project_settings  spider = LinkedInAnonymousSpider(None, "James", "Bond") process = CrawlerProcess(get_project_settings()) process.crawl(spider) ## <-------------- (1) process.start() 

I found out that process.crawl() in (1) is creating another LinkedInAnonymousSpider where first and last are None (printed in (2)), if so, then there is no point of creating the object spider and how is it possible to pass the arguments first and last to process.crawl()?

linkedin_anonymous :

from logging import INFO  import scrapy  class LinkedInAnonymousSpider(scrapy.Spider):     name = "linkedin_anonymous"     allowed_domains = ["linkedin.com"]     start_urls = []      base_url = "https://www.linkedin.com/pub/dir/?first=%s&last=%s&search=Search"      def __init__(self, input = None, first= None, last=None):         self.input = input  # source file name         self.first = first         self.last = last      def start_requests(self):         print self.first ## <------------- (2)         if self.first and self.last: # taking input from command line parameters                 url = self.base_url % (self.first, self.last)                 yield self.make_requests_from_url(url)      def parse(self, response): . . . 
like image 306
yusuf Avatar asked Dec 20 '15 15:12

yusuf


People also ask

How are arguments passed in Scrapy?

The spider will receive arguments in its constructor. Scrapy puts all the arguments as spider attributes and you can skip the init method completely. Beware use getattr method for getting those attributes so your code does not break.

How do you run a Scrapy Crawler?

The key to running scrapy in a python script is the CrawlerProcess class. This is a class of the Crawler module. It provides the engine to run scrapy within a python script. Within the CrawlerProcess class code, python's twisted framework is imported.

What does Scrapy crawl do?

Scrapy is an application framework for writing web spiders that crawl web sites and extract data from them. Scrapy provides a built-in mechanism for extracting data (called selectors) but you can easily use BeautifulSoup (or lxml) instead, if you feel more comfortable working with them.

How do you use multiple spiders in Scrapy?

We use the CrawlerProcess class to run multiple Scrapy spiders in a process simultaneously. We need to create an instance of CrawlerProcess with the project settings. We need to create an instance of Crawler for the spider if we want to have custom settings for the Spider.


2 Answers

pass the spider arguments on the process.crawl method:

process.crawl(spider, input='inputargument', first='James', last='Bond') 
like image 57
eLRuLL Avatar answered Sep 28 '22 10:09

eLRuLL


You can do it the easy way:

from scrapy import cmdline  cmdline.execute("scrapy crawl linkedin_anonymous -a first=James -a last=Bond -o output.json".split()) 
like image 22
Manualmsdos Avatar answered Sep 28 '22 10:09

Manualmsdos