Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy - Getting duplicated items using JOBDIR

Tags:

python

csv

scrapy

Scrapy's JOBDIR setting provides resumable crawls, described here:

http://doc.scrapy.org/en/latest/topics/jobs.html

I try to execute my crawl command like this:

scrapy crawl myspider -o out.csv -t csv -s JOBDIR=./jobs/run-1

While it's still running, I shut it down gracefully by pressing CTRL-C. Then fire the same command again to resume it. I can confirm that it was resuming crawl from the terminal output:

[myspider] INFO: Resuming crawl (74 requests scheduled)

But when I view my output CSV file, I see there are duplicated items like this:

name,email
Alice,[email protected]
Bob,[email protected]
...
name,email            <- duplicated header!
Bob,[email protected]   <- duplicated row!
...

Is this normal? I wonder if it's okay to use -o option and JOBDIR in the same command. If not, how do I export the crawled items?

BTW, I'm using Scrapy 0.22.1.

Thanks!

like image 758
eliang Avatar asked Mar 06 '14 16:03

eliang


1 Answers

Yes this is to be expected. If you'd have a look at the source code of scrapy, and particularly the CsvItemExporter, you'll find that it is stateless with respect to stopping / resuming crawls. The exporter basically handles the headers with 2 flags. One that instructs it whether or not to include the headers at all: include_headers_line. The second: _headers_not_written, prevents the headers from being dumped every time, a new scraped item, is written,except for the first item of the session. These flags are however reset every time the crawler is restarted anew and the exporter doesn't seem to carry any kind of information about resumed sessions:

class CsvItemExporter(BaseItemExporter):

    def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):

        ....
        self._headers_not_written = True
        ....

    def export_item(self, item):
        if self._headers_not_written:
            self._headers_not_written = False
            self._write_headers_and_set_fields_to_export(item)

Also the -o option does nothing more than instructing the crawler to dump the scraped items into the specified output:

class Command(ScrapyCommand):

    ....

    def add_options(self, parser):
        ScrapyCommand.add_options(self, parser)
        parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE", \
            help="set spider argument (may be repeated)")
        parser.add_option("-o", "--output", metavar="FILE", \
            help="dump scraped items into FILE (use - for stdout)")
        parser.add_option("-t", "--output-format", metavar="FORMAT", default="jsonlines", \
            help="format to use for dumping items with -o (default: %default)")
like image 162
b2Wc0EKKOvLPn Avatar answered Sep 19 '22 05:09

b2Wc0EKKOvLPn