I want to save each quote from http://quotes.toscrape.com/ saved into a csv file (2 field : author, quote). One other necessity is to save these quotes in different files seperated by the page they reside. ie : (page1.csv, page2.csv ...). I have tried to achieve this by declaring feed exports in custom_settings
attribute in my spider as shown below. This, however, doesn't even produce a file called page-1.csv
. I am a total beginner using scrapy, please try to explain assuming I know little to nothing.
import scrapy
import urllib
class spidey(scrapy.Spider):
name = "idk"
start_urls = [
"http://quotes.toscrape.com/"
]
custom_settings = {
'FEEDS' : {
'file://page-1.csv' : { #edit: uri needs to be absolute path
'format' : 'csv',
'store_empty' : True
}
},
'FEED_EXPORT_ENCODING' : 'utf-8',
'FEED_EXPORT_FIELDS' : ['author', 'quote']
}
def parse(self, response):
for qts in response.xpath("//*[@class=\"quote\"]"):
author = qts.xpath("./span[2]/small/text()").get()
quote = qts.xpath("./*[@class=\"text\"]/text()").get()
yield {
'author' : author,
'quote' : quote
}
next_pg = response.xpath('//li[@class="next"]/a/@href').get()
if next_pg is not None:
next_pg = urllib.parse.urljoin(self.start_urls[0], next_pg)
yield scrapy.Request(next_pg, self.parse)
How I ran the crawler: scrapy crawl idk
As an added question, I need my files to be overwritten as opposed to being appended like when specifying -o
flag. Is it possible to do it without having to manually check/delete preexisting files from spider?
Saving your items into a file named after the page you found them in is (afaik) not supported in settings. If you wanted to achieve this, you could create your own functionality for that with python's open
function and csv.writer
in your parse
method. An alternate option would be to write an item pipeline which manages different item exporters for different files.
What you can do with settings however is limit the number of items in a file with the FEED_EXPORT_BATCH_ITEM_COUNT
setting, which is supported since of Scrapy version 2.3.
Overwriting instead of appending to a file can also be done since of Scrapy 2.4. In FEEDS
you can set overwrite
to True as demonstrated shortly.
If you were to replace your custom_settings
with the following, it would produce files with 10 items each named page-
followed by the batch_id
, which starts with one. So your first 3 files would be named page-1.csv, page-2.csv and page-3.csv.
custom_settings = {
'FEED_EXPORT_BATCH_ITEM_COUNT': 10,
'FEEDS' : {
'page-%(batch_id)d.csv' : {
'format' : 'csv',
'store_empty' : True,
'overwrite': True
}
}
}
If you wanted to implement this using an item pipeline, you could save the page number you are on in the dictionary you return, which then gets processed and removed by the item pipeline.
The pipeline in your pipelines.py
(based on this example) could then look like this:
from scrapy.exporters import CsvItemExporter
class PerFilenameExportPipeline:
"""Distribute items across multiple CSV files according to their 'page' field"""
def open_spider(self, spider):
self.filename_to_exporter = {}
def close_spider(self, spider):
for exporter in self.filename_to_exporter.values():
exporter.finish_exporting()
def _exporter_for_item(self, item):
filename = 'page-' + str(item['page_no'])
del item['page_no']
if filename not in self.filename_to_exporter:
f = open(f'{filename}.csv', 'wb')
exporter = CsvItemExporter(f)
exporter.start_exporting()
self.filename_to_exporter[filename] = exporter
return self.filename_to_exporter[filename]
def process_item(self, item, spider):
exporter = self._exporter_for_item(item)
exporter.export_item(item)
return item
To your spider you would then need to add a routine to get the page you are on as well as setting the pipeline in your custom_settings
, which you could do like the following:
import scrapy
from ..pipelines import PerFilenameExportPipeline
class spidey(scrapy.Spider):
name = "idk"
custom_settings = {
'ITEM_PIPELINES': {
PerFilenameExportPipeline: 100
}
}
def start_requests(self):
yield scrapy.Request("http://quotes.toscrape.com/", cb_kwargs={'page_no': 1})
def parse(self, response, page_no):
for qts in response.xpath("//*[@class=\"quote\"]"):
yield {
'page_no': page_no,
'author' : qts.xpath("./span[2]/small/text()").get(),
'quote' : qts.xpath("./*[@class=\"text\"]/text()").get()
}
next_pg = response.xpath('//li[@class="next"]/a/@href').get()
if next_pg is not None:
yield response.follow(next_pg, cb_kwargs={'page_no': page_no + 1})
However, there is one issue with this. The last file (page-10.csv) stays empty for reasons beyond my comprehension. I have asked why that could be here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With