Scrapy's JOBDIR setting provides resumable crawls, described here:
http://doc.scrapy.org/en/latest/topics/jobs.html
I try to execute my crawl command like this:
scrapy crawl myspider -o out.csv -t csv -s JOBDIR=./jobs/run-1
While it's still running, I shut it down gracefully by pressing CTRL-C. Then fire the same command again to resume it. I can confirm that it was resuming crawl from the terminal output:
[myspider] INFO: Resuming crawl (74 requests scheduled)
But when I view my output CSV file, I see there are duplicated items like this:
name,email
Alice,[email protected]
Bob,[email protected]
...
name,email <- duplicated header!
Bob,[email protected] <- duplicated row!
...
Is this normal? I wonder if it's okay to use -o
option and JOBDIR
in the same command. If not, how do I export the crawled items?
BTW, I'm using Scrapy 0.22.1.
Thanks!
Yes this is to be expected. If you'd have a look at the source code of scrapy, and particularly the CsvItemExporter
, you'll find that it is stateless with respect to stopping / resuming crawls. The exporter basically handles the headers with 2 flags. One that instructs it whether or not to include the headers at all: include_headers_line
. The second: _headers_not_written
, prevents the headers from being dumped every time, a new scraped item, is written,except for the first item of the session. These flags are however reset every time the crawler is restarted anew and the exporter doesn't seem to carry any kind of information about resumed sessions:
class CsvItemExporter(BaseItemExporter):
def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
....
self._headers_not_written = True
....
def export_item(self, item):
if self._headers_not_written:
self._headers_not_written = False
self._write_headers_and_set_fields_to_export(item)
Also the -o option does nothing more than instructing the crawler to dump the scraped items into the specified output:
class Command(ScrapyCommand):
....
def add_options(self, parser):
ScrapyCommand.add_options(self, parser)
parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE", \
help="set spider argument (may be repeated)")
parser.add_option("-o", "--output", metavar="FILE", \
help="dump scraped items into FILE (use - for stdout)")
parser.add_option("-t", "--output-format", metavar="FORMAT", default="jsonlines", \
help="format to use for dumping items with -o (default: %default)")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With