Scrapy

Question

Scrapy's JOBDIR setting provides resumable crawls, described here:

http://doc.scrapy.org/en/latest/topics/jobs.html

I try to execute my crawl command like this:

scrapy crawl myspider -o out.csv -t csv -s JOBDIR=./jobs/run-1

While it's still running, I shut it down gracefully by pressing CTRL-C. Then fire the same command again to resume it. I can confirm that it was resuming crawl from the terminal output:

[myspider] INFO: Resuming crawl (74 requests scheduled)

But when I view my output CSV file, I see there are duplicated items like this:

name,email
Alice,alice@example.com
Bob,bob@example.com
...
name,email            <- duplicated header!
Bob,bob@example.com   <- duplicated row!
...

Is this normal? I wonder if it's okay to use -o option and JOBDIR in the same command. If not, how do I export the crawled items?

BTW, I'm using Scrapy 0.22.1.

Thanks!

b2Wc0EKKOvLPn · Accepted Answer

Yes this is to be expected. If you'd have a look at the source code of scrapy, and particularly the CsvItemExporter, you'll find that it is stateless with respect to stopping / resuming crawls. The exporter basically handles the headers with 2 flags. One that instructs it whether or not to include the headers at all: include_headers_line. The second: _headers_not_written, prevents the headers from being dumped every time, a new scraped item, is written,except for the first item of the session. These flags are however reset every time the crawler is restarted anew and the exporter doesn't seem to carry any kind of information about resumed sessions:

class CsvItemExporter(BaseItemExporter):

    def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):

        ....
        self._headers_not_written = True
        ....

    def export_item(self, item):
        if self._headers_not_written:
            self._headers_not_written = False
            self._write_headers_and_set_fields_to_export(item)

Also the -o option does nothing more than instructing the crawler to dump the scraped items into the specified output:

class Command(ScrapyCommand):

    ....

    def add_options(self, parser):
        ScrapyCommand.add_options(self, parser)
        parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE", \
            help="set spider argument (may be repeated)")
        parser.add_option("-o", "--output", metavar="FILE", \
            help="dump scraped items into FILE (use - for stdout)")
        parser.add_option("-t", "--output-format", metavar="FORMAT", default="jsonlines", \
            help="format to use for dumping items with -o (default: %default)")

Scrapy - Getting duplicated items using JOBDIR

Tags:

python

csv

eliang

1 Answers

b2Wc0EKKOvLPn

Recent Activity

Donate For Us