Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Scrapy: What is the difference between "runspider" and "crawl" commands?

Can someone explain the difference between runspider and crawl commands? What are the contexts in which they should be used?

like image 798
RpB Avatar asked Jun 03 '16 06:06

RpB


People also ask

How do you run the Scrapy spider command line?

You can start by running the Scrapy tool with no arguments and it will print some usage help and the available commands: Scrapy X.Y - no active project Usage: scrapy <command> [options] [args] Available commands: crawl Run a spider fetch Fetch a URL using the Scrapy downloader [...]

How does Scrapy spider work?

Spiders are classes that you define and that Scrapy uses to scrape information from a website (or a group of websites). They must subclass Spider and define the initial requests to make, optionally how to follow links in the pages, and how to parse the downloaded page content to extract data.


2 Answers

In the command:

scrapy crawl [options] <spider>

<spider> is the project name (defined in settings.py, as BOT_NAME).

And in the command:

scrapy runspider [options] <spider_file>

<spider_file> is the path to the file that contains the spider.

Otherwise, the options are the same:

Options
=======
--help, -h              show this help message and exit
-a NAME=VALUE           set spider argument (may be repeated)
--output=FILE, -o FILE  dump scraped items into FILE (use - for stdout)
--output-format=FORMAT, -t FORMAT
                        format to use for dumping items with -o

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--lsprof=FILE           write lsprof profiling stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

Since runspider doesn't depend on the BOT_NAME parameter, depending on the way you are customising your scrapers, you might find runspider more flexible.

like image 195
Ivan Chaer Avatar answered Sep 23 '22 22:09

Ivan Chaer


The main difference is that runspider does not need a project. That is, you can write a spider in a myspider.py file and call scrapy runspider myspider.py.

The crawl command requires a project in order to find the project's settings, load available spiders from SPIDER_MODULES settings, and lookup the spider by name.

If you need quick spider for a short task, then runspider has less boilerplate required.

like image 31
R. Max Avatar answered Sep 21 '22 22:09

R. Max