Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing to multiple files with Scrapy

I'm scraping a website with Scrapy and would like to split the results into two parts. Usually I call Scrapy like this:

$ scrapy crawl articles -o articles.json
$ scrapy crawl authors  -o  authors.json

The two spiders are completely independent and don't communicate at all. This setup works for smaller websites, but larger websites have just too many authors for me to crawl like this.

How would I have the articles spider tell the authors spider what pages to crawl and maintain this two-file structure? Ideally, I'd rather not write the author URLs to a file and then read it back with the other spider.

like image 462
Blender Avatar asked Feb 03 '13 21:02

Blender


1 Answers

I ended up using command line arguments for the author scraper:

class AuthorSpider(BaseSpider):
    ...

    def __init__(self, articles):
        self.start_urls = []

        for line in articles:
            article = json.loads(line)
            self.start_urls.append(data['author_url'])

Then, I added the duplicates pipeline outlined in the Scrapy documentation:

from scrapy import signals
from scrapy.exceptions import DropItem

class DuplicatesPipeline(object):
    def __init__(self):
        self.ids_seen = set()

    def process_item(self, item, spider):
        if item['id'] in self.ids_seen:
            raise DropItem("Duplicate item found: %s" % item)
        else:
            self.ids_seen.add(item['id'])
            return item

Finally, I passed the article JSON lines file into the command:

$ scrapy crawl authors -o authors.json -a articles=articles.json

It's not a great solution, but it works.

like image 161
Blender Avatar answered Oct 03 '22 08:10

Blender