Pause scrapy. Can I get a breakdown?

Question

I want to be able to start/pause/resume a spider and I am attempting to use

scrapy crawl some spiders JOBDIR=crawls/some spider-1

However, it's mostly just a copy and paste as there isn't much info about whats actually going on here. Anyone have some more info on the specifics?

I get the first part, but don't know whats actually happening with JOBDIR=crawls/some spider-1 part. I see people putting the code like this

scrapy crawl some spiders JOBDIR=crawls/some spider

.. without the -1 and don't know what difference that makes. I did notice this. I tend to pound the CTRL+C to quit and that's apparently bad from what I read and what I experienced because if I retype the code

scrapy crawl some spiders JOBDIR=crawls/some spider-1

.. it goes straight to finished like the spider is done.

How do I "Reset" it after I make that mistake? If I take out the -1 it will work again but I don't know if I am losing something there.

paul trmbrth · Accepted Answer

As explained in the docs, scrapy allows pausing and resuming crawl, but you need a JOBDIR setting.

JOBDIR value is supposed to be the path to a directory on your filesystem to persist various objects scrapy needs to resume what it has to do.

Note that for seperate crawls you need to point to a different directory:

This directory will be for storing all required data to keep the state of a single job (ie. a spider run). It’s important to note that this directory must not be shared by different spiders, or even different jobs/runs of the same spider, as it’s meant to be used for storing the state of a single job.

Copying what in that docs page:

scrapy crawl somespider -s JOBDIR=crawls/somespider-1
             ----------           -------------------
                 |                         |       
         name of your spider               |        
                                           |
                               relative path where to save stuff

Another example scrapy crawl command using JOBDIR could have been:

scrapy crawl myspider -s JOBDIR=/home/myuser/crawldata/myspider_run_32

Example timeline:

scrapy crawl myspider -s JOBDIR=/home/myuser/crawldata/myspider_run_001
# pause using Ctrl-C ...

# ...lets continue where it was left off
scrapy crawl myspider -s JOBDIR=/home/myuser/crawldata/myspider_run_001
# crawl finished properly.
# (and /home/myuser/crawldata/myspider_run_001 should not contain anything now)

# now you want to crawl a 2nd time, from the beginning
scrapy crawl myspider -s JOBDIR=/home/myuser/crawldata/myspider_run_002

Pause scrapy. Can I get a breakdown?

Tags:

python

scrapy

web-crawler

Nick

1 Answers

paul trmbrth

Recent Activity

Donate For Us

Pause scrapy. Can I get a breakdown?

Tags:

python

scrapy

web-crawler

Nick

1 Answers

paul trmbrth

Related questions

Recent Activity

Donate For Us