I want to be able to start/pause/resume a spider and I am attempting to use
scrapy crawl some spiders JOBDIR=crawls/some spider-1
However, it's mostly just a copy and paste as there isn't much info about whats actually going on here. Anyone have some more info on the specifics?
I get the first part, but don't know whats actually happening with JOBDIR=crawls/some spider-1 part. I see people putting the code like this
scrapy crawl some spiders JOBDIR=crawls/some spider
.. without the -1 and don't know what difference that makes. I did notice this. I tend to pound the CTRL+C to quit and that's apparently bad from what I read and what I experienced because if I retype the code
scrapy crawl some spiders JOBDIR=crawls/some spider-1
.. it goes straight to finished like the spider is done.
How do I "Reset" it after I make that mistake? If I take out the -1 it will work again but I don't know if I am losing something there.
As explained in the docs, scrapy allows pausing and resuming crawl, but you need a JOBDIR setting.
JOBDIR value is supposed to be the path to a directory on your filesystem to persist various objects scrapy needs to resume what it has to do.
Note that for seperate crawls you need to point to a different directory:
This directory will be for storing all required data to keep the state of a single job (ie. a spider run). It’s important to note that this directory must not be shared by different spiders, or even different jobs/runs of the same spider, as it’s meant to be used for storing the state of a single job.
Copying what in that docs page:
scrapy crawl somespider -s JOBDIR=crawls/somespider-1
---------- -------------------
| |
name of your spider |
|
relative path where to save stuff
Another example scrapy crawl command using JOBDIR could have been:
scrapy crawl myspider -s JOBDIR=/home/myuser/crawldata/myspider_run_32
Example timeline:
scrapy crawl myspider -s JOBDIR=/home/myuser/crawldata/myspider_run_001
# pause using Ctrl-C ...
# ...lets continue where it was left off
scrapy crawl myspider -s JOBDIR=/home/myuser/crawldata/myspider_run_001
# crawl finished properly.
# (and /home/myuser/crawldata/myspider_run_001 should not contain anything now)
# now you want to crawl a 2nd time, from the beginning
scrapy crawl myspider -s JOBDIR=/home/myuser/crawldata/myspider_run_002
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With