Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scrapy: How to output items in a specific json format

I output the scraped data in json format. Default scrapy exporter outputs list of dict in json format. Item type looks like:

[{"Product Name":"Product1", "Categories":["Clothing","Top"], "Price":"20.5", "Currency":"USD"},
{"Product Name":"Product2", "Categories":["Clothing","Top"], "Price":"21.5", "Currency":"USD"},
{"Product Name":"Product3", "Categories":["Clothing","Top"], "Price":"22.5", "Currency":"USD"},
{"Product Name":"Product4", "Categories":["Clothing","Top"], "Price":"23.5", "Currency":"USD"}, ...]

But I want to export the data in a specific format like this:

{
"Shop Name":"Shop 1",
"Location":"XXXXXXXXX",
"Contact":"XXXX-XXXXX",
"Products":
[{"Product Name":"Product1", "Categories":["Clothing","Top"], "Price":"20.5", "Currency":"USD"},
{"Product Name":"Product2", "Categories":["Clothing","Top"], "Price":"21.5", "Currency":"USD"},
{"Product Name":"Product3", "Categories":["Clothing","Top"], "Price":"22.5", "Currency":"USD"},
{"Product Name":"Product4", "Categories":["Clothing","Top"], "Price":"23.5", "Currency":"USD"}, ...]
}

Please advice me any solution. Thank you.

like image 989
bbanzzakji Avatar asked Mar 26 '17 00:03

bbanzzakji


Video Answer


2 Answers

This is well documented at scrapy web page here.

from scrapy.exporters import JsonItemExporter


class ItemPipeline(object):

    file = None

    def open_spider(self, spider):
        self.file = open('item.json', 'w')
        self.exporter = JsonItemExporter(self.file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process_item(self, item, spider):
        self.exporter.export_item(item)
        return item

This will create a json file containing your items.

like image 105
Gocht Avatar answered Oct 31 '22 12:10

Gocht


I was trying to export pretty printed JSON and this is what worked for me.

I created a pipeline that looked like this:

class JsonPipeline(object):

    def open_spider(self, spider):
        self.file = open('your_file_name.json', 'wb')
        self.file.write("[")

    def close_spider(self, spider):
        self.file.write("]")
        self.file.close()

    def process_item(self, item, spider):
        line = json.dumps(
            dict(item),
            sort_keys=True,
            indent=4,
            separators=(',', ': ')
        ) + ",\n"

        self.file.write(line)
        return item

It's similar to the example from the scrapy docs https://doc.scrapy.org/en/latest/topics/item-pipeline.html except it prints each JSON property indented and on a new line.

See the part about pretty printing here https://docs.python.org/2/library/json.html

like image 39
Max Avatar answered Oct 31 '22 14:10

Max