Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Scrapy how to save data in different files

Tags:

python

scrapy

I want to save each quote from http://quotes.toscrape.com/ saved into a csv file (2 field : author, quote). One other necessity is to save these quotes in different files seperated by the page they reside. ie : (page1.csv, page2.csv ...). I have tried to achieve this by declaring feed exports in custom_settings attribute in my spider as shown below. This, however, doesn't even produce a file called page-1.csv. I am a total beginner using scrapy, please try to explain assuming I know little to nothing.

import scrapy
import urllib

class spidey(scrapy.Spider):
    name = "idk"
    start_urls = [
        "http://quotes.toscrape.com/"
    ]

    custom_settings = {
        'FEEDS' : {
            'file://page-1.csv' : { #edit: uri needs to be absolute path
                'format' : 'csv',
                'store_empty' : True
            }
        },
        'FEED_EXPORT_ENCODING' : 'utf-8',
        'FEED_EXPORT_FIELDS' : ['author', 'quote']
    }
    

    def parse(self, response):
        for qts in response.xpath("//*[@class=\"quote\"]"):
            author = qts.xpath("./span[2]/small/text()").get()
            quote = qts.xpath("./*[@class=\"text\"]/text()").get()
            yield {
                'author' : author,
                'quote' : quote
                }

        next_pg = response.xpath('//li[@class="next"]/a/@href').get()      
        if next_pg is not None:
            next_pg = urllib.parse.urljoin(self.start_urls[0], next_pg)
            yield scrapy.Request(next_pg, self.parse)

How I ran the crawler: scrapy crawl idk As an added question, I need my files to be overwritten as opposed to being appended like when specifying -o flag. Is it possible to do it without having to manually check/delete preexisting files from spider?

like image 430
Silver Flash Avatar asked Nov 07 '22 05:11

Silver Flash


1 Answers

Saving your items into a file named after the page you found them in is (afaik) not supported in settings. If you wanted to achieve this, you could create your own functionality for that with python's open function and csv.writer in your parse method. An alternate option would be to write an item pipeline which manages different item exporters for different files.

What you can do with settings however is limit the number of items in a file with the FEED_EXPORT_BATCH_ITEM_COUNT setting, which is supported since of Scrapy version 2.3.
Overwriting instead of appending to a file can also be done since of Scrapy 2.4. In FEEDS you can set overwrite to True as demonstrated shortly.

If you were to replace your custom_settings with the following, it would produce files with 10 items each named page- followed by the batch_id, which starts with one. So your first 3 files would be named page-1.csv, page-2.csv and page-3.csv.

    custom_settings = {
        'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
        'FEEDS' : {
            'page-%(batch_id)d.csv' : {
                'format' : 'csv',
                'store_empty' : True,
                'overwrite': True
            }
        }
    }

Implementing as pipeline

If you wanted to implement this using an item pipeline, you could save the page number you are on in the dictionary you return, which then gets processed and removed by the item pipeline.

The pipeline in your pipelines.py (based on this example) could then look like this:

from scrapy.exporters import CsvItemExporter


class PerFilenameExportPipeline:
    """Distribute items across multiple CSV files according to their 'page' field"""

    def open_spider(self, spider):
        self.filename_to_exporter = {}

    def close_spider(self, spider):
        for exporter in self.filename_to_exporter.values():
            exporter.finish_exporting()

    def _exporter_for_item(self, item):
        filename = 'page-' + str(item['page_no'])
        del item['page_no']
        if filename not in self.filename_to_exporter:
            f = open(f'{filename}.csv', 'wb')
            exporter = CsvItemExporter(f)
            exporter.start_exporting()
            self.filename_to_exporter[filename] = exporter
        return self.filename_to_exporter[filename]

    def process_item(self, item, spider):
        exporter = self._exporter_for_item(item)
        exporter.export_item(item)
        return item

To your spider you would then need to add a routine to get the page you are on as well as setting the pipeline in your custom_settings, which you could do like the following:

import scrapy
from ..pipelines import PerFilenameExportPipeline


class spidey(scrapy.Spider):
    name = "idk"
    custom_settings = {
        'ITEM_PIPELINES': {
            PerFilenameExportPipeline: 100
        }
    }
    
    def start_requests(self):
        yield scrapy.Request("http://quotes.toscrape.com/", cb_kwargs={'page_no': 1})

    def parse(self, response, page_no):
        for qts in response.xpath("//*[@class=\"quote\"]"):
            yield {
                'page_no': page_no,
                'author' : qts.xpath("./span[2]/small/text()").get(),
                'quote' : qts.xpath("./*[@class=\"text\"]/text()").get()
            }

        next_pg = response.xpath('//li[@class="next"]/a/@href').get()      
        if next_pg is not None:
            yield response.follow(next_pg, cb_kwargs={'page_no': page_no + 1})

However, there is one issue with this. The last file (page-10.csv) stays empty for reasons beyond my comprehension. I have asked why that could be here.

like image 129
Patrick Klein Avatar answered Nov 14 '22 22:11

Patrick Klein