I'm trying to create a custom Scrapy Item Exporter based off JsonLinesItemExporter so I can slightly alter the structure it produces.
I have read the documentation here http://doc.scrapy.org/en/latest/topics/exporters.html but it doesn't state how to create a custom exporter, where to store it or how to link it to your Pipeline.
I have identified how to go custom with the Feed Exporters but this is not going to suit my requirements, as I want to call this exporter from my Pipeline.
Here is the code I've come up with which has been stored in a file in the root of the project called exporters.py
from scrapy.contrib.exporter import JsonLinesItemExporter
class FanItemExporter(JsonLinesItemExporter):
def __init__(self, file, **kwargs):
self._configure(kwargs, dont_fail=True)
self.file = file
self.encoder = ScrapyJSONEncoder(**kwargs)
self.first_item = True
def start_exporting(self):
self.file.write("""{
'product': [""")
def finish_exporting(self):
self.file.write("]}")
def export_item(self, item):
if self.first_item:
self.first_item = False
else:
self.file.write(',\n')
itemdict = dict(self._get_serialized_fields(item))
self.file.write(self.encoder.encode(itemdict))
I have simply tried calling this from my pipeline by using FanItemExporter and trying variations of the import but it's not resulting in anything.
Saving CSV Files Via The Command Line The first and simplest way to create a CSV file of the data you have scraped, is to simply define a output path when starting your spider in the command line. To save to a CSV file add the flag -o to the scrapy crawl command along with the file path you want to save the file to.
Feed exports is a method of storing the data scraped from the sites, that is generating a "export file".
Scrapy is a web scraping library that is used to scrape, parse and collect web data. For all these functions we are having a pipelines.py file which is used to handle scraped data through various components (known as class) which are executed sequentially.
It is true that the Scrapy documentation does not clearly state where to place an Item Exporter. To use an Item Exporter, these are the steps to follow.
pipeline.py
in the project directory. It could be a pre-defined Item Exporter (ex. XmlItemExporter
) or user-defined (like FanItemExporter
defined in the question)pipeline.py
. Instantiate the imported Item Exporter in this class. Details will be explained in the later part of the answer.settings.py
file.Following is a detailed explanation of each step. Solution to the question is included in each step.
If using a pre-defined Item Exporter class, import it from scrapy.exporters
module.
Ex:
from scrapy.exporters import XmlItemExporter
If you need a custom exporter, define a custom class in a file. I suggest placing the class in exporters.py
file. Place this file in the project folder (where settings.py
, items.py
reside).
While creating a new sub-class, it is always a good idea to import BaseItemExporter
. It would be apt if we intend to change the functionality entirely. However, in this question, most of the functionality is close to JsonLinesItemExporter
.
Hence, I am attaching two versions of the same ItemExporter. One version extends BaseItemExporter
class and the other extends JsonLinesItemExporter
class
Version 1: Extending BaseItemExporter
Since BaseItemExporter
is the parent class, start_exporting()
, finish_exporting()
, export_item()
must be overrided to suit our needs.
from scrapy.exporters import BaseItemExporter
from scrapy.utils.serialize import ScrapyJSONEncoder
from scrapy.utils.python import to_bytes
class FanItemExporter(BaseItemExporter):
def __init__(self, file, **kwargs):
self._configure(kwargs, dont_fail=True)
self.file = file
self.encoder = ScrapyJSONEncoder(**kwargs)
self.first_item = True
def start_exporting(self):
self.file.write(b'{\'product\': [')
def finish_exporting(self):
self.file.write(b'\n]}')
def export_item(self, item):
if self.first_item:
self.first_item = False
else:
self.file.write(b',\n')
itemdict = dict(self._get_serialized_fields(item))
self.file.write(to_bytes(self.encoder.encode(itemdict)))
Version 2: Extending JsonLinesItemExporter
JsonLinesItemExporter
provides the exact same implementation of export_item()
method. Therefore only start_exporting()
and finish_exporting()
methods are overrided.
Implementation of JsonLinesItemExporter
can be seen in the folder python_dir\pkgs\scrapy-1.1.0-py35_0\Lib\site-packages\scrapy\exporters.py
from scrapy.exporters import JsonItemExporter
class FanItemExporter(JsonItemExporter):
def __init__(self, file, **kwargs):
# To initialize the object using JsonItemExporter's constructor
super().__init__(file)
def start_exporting(self):
self.file.write(b'{\'product\': [')
def finish_exporting(self):
self.file.write(b'\n]}')
Note: When writing data to file, it is important to note that the standard Item Exporter classes expect binary files. Hence, the file must be opened in binary mode (b
). For the same reason, write()
method in both the version write bytes
to file.
Creating an Item Pipeline class.
from project_name.exporters import FanItemExporter
class FanExportPipeline(object):
def __init__(self, file_name):
# Storing output filename
self.file_name = file_name
# Creating a file handle and setting it to None
self.file_handle = None
@classmethod
def from_crawler(cls, crawler):
# getting the value of FILE_NAME field from settings.py
output_file_name = crawler.settings.get('FILE_NAME')
# cls() calls FanExportPipeline's constructor
# Returning a FanExportPipeline object
return cls(output_file_name)
def open_spider(self, spider):
print('Custom export opened')
# Opening file in binary-write mode
file = open(self.file_name, 'wb')
self.file_handle = file
# Creating a FanItemExporter object and initiating export
self.exporter = FanItemExporter(file)
self.exporter.start_exporting()
def close_spider(self, spider):
print('Custom Exporter closed')
# Ending the export to file from FanItemExport object
self.exporter.finish_exporting()
# Closing the opened output file
self.file_handle.close()
def process_item(self, item, spider):
# passing the item to FanItemExporter object for expoting to file
self.exporter.export_item(item)
return item
Since the Item Export Pipeline is defined, register this pipeline in settings.py
file. Also add the field FILE_NAME
to settings.py
file. This field contains the filename of the output file.
Add the following lines to settings.py
file.
FILE_NAME = 'path/outputfile.ext'
ITEM_PIPELINES = {
'project_name.pipelines.FanExportPipeline' : 600,
}
If ITEM_PIPELINES
is already uncommented, then add the following line to the ITEM_PIPELINES
dictionary.
'project_name.pipelines.FanExportPipeline' : 600,
This is one way to create a custom Item Export pipeline.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With