Writing to multiple files with Scrapy

Question

I'm scraping a website with Scrapy and would like to split the results into two parts. Usually I call Scrapy like this:

$ scrapy crawl articles -o articles.json
$ scrapy crawl authors  -o  authors.json

The two spiders are completely independent and don't communicate at all. This setup works for smaller websites, but larger websites have just too many authors for me to crawl like this.

How would I have the articles spider tell the authors spider what pages to crawl and maintain this two-file structure? Ideally, I'd rather not write the author URLs to a file and then read it back with the other spider.

Blender · Accepted Answer

I ended up using command line arguments for the author scraper:

class AuthorSpider(BaseSpider):
    ...

    def __init__(self, articles):
        self.start_urls = []

        for line in articles:
            article = json.loads(line)
            self.start_urls.append(data['author_url'])

Then, I added the duplicates pipeline outlined in the Scrapy documentation:

from scrapy import signals
from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):
    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

Finally, I passed the article JSON lines file into the command:

$ scrapy crawl authors -o authors.json -a articles=articles.json

It's not a great solution, but it works.

Writing to multiple files with Scrapy

Tags:

python

web-scraping

scrapy

Blender

1 Answers

Blender

Recent Activity

Donate For Us

Writing to multiple files with Scrapy

Tags:

python

web-scraping

scrapy

Blender

1 Answers

Blender

Related questions

Recent Activity

Donate For Us