I'm scraping a website with Scrapy and would like to split the results into two parts. Usually I call Scrapy like this:
$ scrapy crawl articles -o articles.json
$ scrapy crawl authors -o authors.json
The two spiders are completely independent and don't communicate at all. This setup works for smaller websites, but larger websites have just too many authors for me to crawl like this.
How would I have the articles
spider tell the authors
spider what pages to crawl and maintain this two-file structure? Ideally, I'd rather not write the author URLs to a file and then read it back with the other spider.
I ended up using command line arguments for the author scraper:
class AuthorSpider(BaseSpider):
...
def __init__(self, articles):
self.start_urls = []
for line in articles:
article = json.loads(line)
self.start_urls.append(data['author_url'])
Then, I added the duplicates pipeline outlined in the Scrapy documentation:
from scrapy import signals
from scrapy.exceptions import DropItem
class DuplicatesPipeline(object):
def __init__(self):
self.ids_seen = set()
def process_item(self, item, spider):
if item['id'] in self.ids_seen:
raise DropItem("Duplicate item found: %s" % item)
else:
self.ids_seen.add(item['id'])
return item
Finally, I passed the article JSON lines file into the command:
$ scrapy crawl authors -o authors.json -a articles=articles.json
It's not a great solution, but it works.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With