Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Buffered pipeline using scrapy

Tags:

python

scrapy

I'm currently scrapping a website using Scrapy 0.24. The website has the following url format:

www.site.com?category={0}&item={1}&page={2}

I have a MySQLStorePipeline which is responsible for storing each scrapped item in the database. But I have 80 categories, 10 items and 15 pages, which results in 80 * 10 * 15 = 120000 pages. Each page I yield 25 scrapy.Items, which give us 25 * 120000 = 3000000 rows in the database.

So, every time the pipeline receives an item, it inserts into the database. And it is not a smart way. I'm looking for a way to buffer the pipeline items and, for example, when we receive 1000 items, execute a bulk insert. How can I achieve that?

like image 723
Doon Avatar asked Dec 14 '22 16:12

Doon


1 Answers

Have the pipeline store items in a list, and insert them when they reach a certain length, and on spider closing.

class Pipeline(object):
    def __init__(self):
        super(Pipeline, self).__init__()
        self.items = []

    def process_item(self, item, spider):
        self.items.append(item)
        if len(self.items) >= 1000:
            self.insert_current_items()
        return item

    def insert_current_items(self):
        items = self.items
        self.items = []
        self.insert_to_database(items)

    def close_spider(self, spider):
        self.insert_current_items()
like image 171
Artur Gaspar Avatar answered Dec 30 '22 15:12

Artur Gaspar