I'm trying to create a custom Scrapy Item Exporter based off JsonLinesItemExporter so I can slightly alter the structure it produces. I have read the documentation here http://doc.scrapy.org/en/latest/topics/exporters.html but it doesn't state how to create a custom exporter, where to store it or how to link it to your Pipeline. I have identified how to go custom with the Feed Exporters but this is not going to suit my requirements, as I want to call this exporter from my Pipeline. Here is the code I've come up with which has been stored in a file in the root of the project called <code>exporters.py</code> <pre class="prettyprint lang-py prettyprint-override"><code> from scrapy.contrib.exporter import JsonLinesItemExporter class FanItemExporter(JsonLinesItemExporter): def __init__(self, file, **kwargs): self._configure(kwargs, dont_fail=True) self.file = file self.encoder = ScrapyJSONEncoder(**kwargs) self.first_item = True def start_exporting(self): self.file.write("""{ 'product': [""") def finish_exporting(self): self.file.write("]}") def export_item(self, item): if self.first_item: self.first_item = False else: self.file.write(',\n') itemdict = dict(self._get_serialized_fields(item)) self.file.write(self.encoder.encode(itemdict)) </code></pre> I have simply tried calling this from my pipeline by using FanItemExporter and trying variations of the import but it's not resulting in anything.

It is true that the Scrapy documentation does not clearly state where to place an Item Exporter. To use an Item Exporter, these are the steps to follow. <ol> <li>Choose an Item Exporter class and import it to <code>pipeline.py</code> in the project directory. It could be a pre-defined Item Exporter (ex. <code>XmlItemExporter</code>) or user-defined (like <code>FanItemExporter</code> defined in the question)</li> <li>Create an Item Pipeline class in <code>pipeline.py</code>. Instantiate the imported Item Exporter in this class. Details will be explained in the later part of the answer.</li> <li>Now, register this pipeline class in <code>settings.py</code> file.</li> </ol> Following is a detailed explanation of each step. Solution to the question is included in each step. <h3>Step 1</h3> <ul> <li>If using a pre-defined Item Exporter class, import it from <code>scrapy.exporters</code> module. Ex: <code>from scrapy.exporters import XmlItemExporter</code></li> <li> If you need a custom exporter, define a custom class in a file. I suggest placing the class in <code>exporters.py</code> file. Place this file in the project folder (where <code>settings.py</code>, <code>items.py</code> reside). While creating a new sub-class, it is always a good idea to import <code>BaseItemExporter</code>. It would be apt if we intend to change the functionality entirely. However, in this question, most of the functionality is close to <code>JsonLinesItemExporter</code>. </li> </ul> Hence, I am attaching two versions of the same ItemExporter. One version extends <code>BaseItemExporter</code> class and the other extends <code>JsonLinesItemExporter</code> class Version 1: Extending <code>BaseItemExporter</code> Since <code>BaseItemExporter</code> is the parent class, <code>start_exporting()</code>, <code>finish_exporting()</code>, <code>export_item()</code> must be overrided to suit our needs. <pre class="prettyprint"><code>from scrapy.exporters import BaseItemExporter from scrapy.utils.serialize import ScrapyJSONEncoder from scrapy.utils.python import to_bytes class FanItemExporter(BaseItemExporter): def __init__(self, file, **kwargs): self._configure(kwargs, dont_fail=True) self.file = file self.encoder = ScrapyJSONEncoder(**kwargs) self.first_item = True def start_exporting(self): self.file.write(b'{\'product\': [') def finish_exporting(self): self.file.write(b'\n]}') def export_item(self, item): if self.first_item: self.first_item = False else: self.file.write(b',\n') itemdict = dict(self._get_serialized_fields(item)) self.file.write(to_bytes(self.encoder.encode(itemdict))) </code></pre> Version 2: Extending <code>JsonLinesItemExporter</code> <code>JsonLinesItemExporter</code> provides the exact same implementation of <code>export_item()</code> method. Therefore only <code>start_exporting()</code> and <code>finish_exporting()</code> methods are overrided. Implementation of <code>JsonLinesItemExporter</code> can be seen in the folder <code>python_dir\pkgs\scrapy-1.1.0-py35_0\Lib\site-packages\scrapy\exporters.py</code> <pre class="prettyprint"><code>from scrapy.exporters import JsonItemExporter class FanItemExporter(JsonItemExporter): def __init__(self, file, **kwargs): # To initialize the object using JsonItemExporter's constructor super().__init__(file) def start_exporting(self): self.file.write(b'{\'product\': [') def finish_exporting(self): self.file.write(b'\n]}') </code></pre> Note: When writing data to file, it is important to note that the standard Item Exporter classes expect binary files. Hence, the file must be opened in binary mode (<code>b</code>). For the same reason, <code>write()</code> method in both the version write <code>bytes</code> to file. <h3>Step 2</h3> Creating an Item Pipeline class. <pre class="prettyprint"><code>from project_name.exporters import FanItemExporter class FanExportPipeline(object): def __init__(self, file_name): # Storing output filename self.file_name = file_name # Creating a file handle and setting it to None self.file_handle = None @classmethod def from_crawler(cls, crawler): # getting the value of FILE_NAME field from settings.py output_file_name = crawler.settings.get('FILE_NAME') # cls() calls FanExportPipeline's constructor # Returning a FanExportPipeline object return cls(output_file_name) def open_spider(self, spider): print('Custom export opened') # Opening file in binary-write mode file = open(self.file_name, 'wb') self.file_handle = file # Creating a FanItemExporter object and initiating export self.exporter = FanItemExporter(file) self.exporter.start_exporting() def close_spider(self, spider): print('Custom Exporter closed') # Ending the export to file from FanItemExport object self.exporter.finish_exporting() # Closing the opened output file self.file_handle.close() def process_item(self, item, spider): # passing the item to FanItemExporter object for expoting to file self.exporter.export_item(item) return item </code></pre> <h3>Step 3</h3> Since the Item Export Pipeline is defined, register this pipeline in <code>settings.py</code> file. Also add the field <code>FILE_NAME</code> to <code>settings.py</code> file. This field contains the filename of the output file. Add the following lines to <code>settings.py</code> file. <pre class="prettyprint"><code>FILE_NAME = 'path/outputfile.ext' ITEM_PIPELINES = { 'project_name.pipelines.FanExportPipeline' : 600, } </code></pre> If <code>ITEM_PIPELINES</code> is already uncommented, then add the following line to the <code>ITEM_PIPELINES</code> dictionary. <code>'project_name.pipelines.FanExportPipeline' : 600,</code> This is one way to create a custom Item Export pipeline.

How to create custom Scrapy Item Exporter?

Tags:

python

json

scrapy

I'm trying to create a custom Scrapy Item Exporter based off JsonLinesItemExporter so I can slightly alter the structure it produces.

I have read the documentation here http://doc.scrapy.org/en/latest/topics/exporters.html but it doesn't state how to create a custom exporter, where to store it or how to link it to your Pipeline.

I have identified how to go custom with the Feed Exporters but this is not going to suit my requirements, as I want to call this exporter from my Pipeline.

Here is the code I've come up with which has been stored in a file in the root of the project called exporters.py


from scrapy.contrib.exporter import JsonLinesItemExporter

class FanItemExporter(JsonLinesItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs, dont_fail=True)
        self.file = file
        self.encoder = ScrapyJSONEncoder(**kwargs)
        self.first_item = True

    def start_exporting(self):
        self.file.write("""{
            'product': [""")

    def finish_exporting(self):
        self.file.write("]}")

    def export_item(self, item):
        if self.first_item:
            self.first_item = False
        else:
            self.file.write(',\n')
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(self.encoder.encode(itemdict))

I have simply tried calling this from my pipeline by using FanItemExporter and trying variations of the import but it's not resulting in anything.

520

asked Oct 22 '15 21:10

bnussey

1 Answers

It is true that the Scrapy documentation does not clearly state where to place an Item Exporter. To use an Item Exporter, these are the steps to follow.

Choose an Item Exporter class and import it to pipeline.py in the project directory. It could be a pre-defined Item Exporter (ex. XmlItemExporter) or user-defined (like FanItemExporter defined in the question)
Create an Item Pipeline class in pipeline.py. Instantiate the imported Item Exporter in this class. Details will be explained in the later part of the answer.
Now, register this pipeline class in settings.py file.

Following is a detailed explanation of each step. Solution to the question is included in each step.

Step 1

If using a pre-defined Item Exporter class, import it from scrapy.exporters module.
Ex: from scrapy.exporters import XmlItemExporter
If you need a custom exporter, define a custom class in a file. I suggest placing the class in exporters.py file. Place this file in the project folder (where settings.py, items.py reside).

While creating a new sub-class, it is always a good idea to import BaseItemExporter. It would be apt if we intend to change the functionality entirely. However, in this question, most of the functionality is close to JsonLinesItemExporter.

Hence, I am attaching two versions of the same ItemExporter. One version extends BaseItemExporter class and the other extends JsonLinesItemExporter class

Version 1: Extending BaseItemExporter

Since BaseItemExporter is the parent class, start_exporting(), finish_exporting(), export_item() must be overrided to suit our needs.

from scrapy.exporters import BaseItemExporter
from scrapy.utils.serialize import ScrapyJSONEncoder
from scrapy.utils.python import to_bytes

class FanItemExporter(BaseItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs, dont_fail=True)
        self.file = file
        self.encoder = ScrapyJSONEncoder(**kwargs)
        self.first_item = True

    def start_exporting(self):
        self.file.write(b'{\'product\': [')

    def finish_exporting(self):
        self.file.write(b'\n]}')

    def export_item(self, item):
        if self.first_item:
            self.first_item = False
        else:
            self.file.write(b',\n')
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(to_bytes(self.encoder.encode(itemdict)))

Version 2: Extending JsonLinesItemExporter

JsonLinesItemExporter provides the exact same implementation of export_item() method. Therefore only start_exporting() and finish_exporting() methods are overrided.

Implementation of JsonLinesItemExporter can be seen in the folder python_dir\pkgs\scrapy-1.1.0-py35_0\Lib\site-packages\scrapy\exporters.py

from scrapy.exporters import JsonItemExporter

class FanItemExporter(JsonItemExporter):

    def __init__(self, file, **kwargs):
        # To initialize the object using JsonItemExporter's constructor
        super().__init__(file)

    def start_exporting(self):
        self.file.write(b'{\'product\': [')

    def finish_exporting(self):
        self.file.write(b'\n]}')

Note: When writing data to file, it is important to note that the standard Item Exporter classes expect binary files. Hence, the file must be opened in binary mode (b). For the same reason, write() method in both the version write bytes to file.

Step 2

Creating an Item Pipeline class.

from project_name.exporters import FanItemExporter

class FanExportPipeline(object):
    def __init__(self, file_name):
        # Storing output filename
        self.file_name = file_name
        # Creating a file handle and setting it to None
        self.file_handle = None

    @classmethod
    def from_crawler(cls, crawler):
        # getting the value of FILE_NAME field from settings.py
        output_file_name = crawler.settings.get('FILE_NAME')

        # cls() calls FanExportPipeline's constructor
        # Returning a FanExportPipeline object
        return cls(output_file_name)

    def open_spider(self, spider):
        print('Custom export opened')

        # Opening file in binary-write mode
        file = open(self.file_name, 'wb')
        self.file_handle = file

        # Creating a FanItemExporter object and initiating export
        self.exporter = FanItemExporter(file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        print('Custom Exporter closed')

        # Ending the export to file from FanItemExport object
        self.exporter.finish_exporting()

        # Closing the opened output file
        self.file_handle.close()

    def process_item(self, item, spider):
        # passing the item to FanItemExporter object for expoting to file
        self.exporter.export_item(item)
        return item

Step 3

Since the Item Export Pipeline is defined, register this pipeline in settings.py file. Also add the field FILE_NAME to settings.py file. This field contains the filename of the output file.

Add the following lines to settings.py file.

FILE_NAME = 'path/outputfile.ext'
ITEM_PIPELINES = {
    'project_name.pipelines.FanExportPipeline' : 600,
}

If ITEM_PIPELINES is already uncommented, then add the following line to the ITEM_PIPELINES dictionary.

'project_name.pipelines.FanExportPipeline' : 600,

This is one way to create a custom Item Export pipeline.

answered Sep 19 '22 18:09

pbskumar

Related questions
                            
                                Convert string into a function call
                            
                                Scan complete directory tree using pep8
                            
                                Divide an image into 5x5 blocks in python and compute histogram for each block
                            
                                Runtime error:App registry isn't ready yet
                            
                                How do I refer to the index of my Pandas dataframe?
                            
                                How do we delete a shape that's already been created in Tkinter canvas?
                            
                                python: How to use POS (part of speech) features in scikit learn classfiers (SVM) etc
                            
                                Apply styles while exporting to 'xlsx' in pandas with XlsxWriter
                            
                                flask-migrate doesn't detect models
                            
                                I can't install Gevent
                            
                                What does the "tk.call" function do in Python/Tkinter?
                            
                                How to vertically concatenate two arrays in Python? [duplicate]
                            
                                Creating classes with a lot of imported functions here and there
                            
                                Pandas: Always selecting the first sheet/tab in an Excel Sheet
                            
                                Find all local Maxima and Minima when x and y values are given as numpy arrays
                            
                                Create sample numpy array with randomly placed NaNs
                            
                                Seaborn distplot y-axis normalisation wrong ticklabels
                            
                                How do you implement token authentication in Flask?
                            
                                python 3.5 type hints: can i check if function arguments match type hints?
                            
                                How to access weighting of indiviual decision trees in xgboost?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With