I'm currently scrapping a website using Scrapy 0.24. The website has the following url format:
www.site.com?category={0}&item={1}&page={2}
I have a MySQLStorePipeline
which is responsible for storing each scrapped item in the database. But I have 80 categories, 10 items and 15 pages, which results in 80 * 10 * 15 = 120000
pages. Each page I yield 25 scrapy.Item
s, which give us 25 * 120000 = 3000000
rows in the database.
So, every time the pipeline receives an item, it inserts into the database. And it is not a smart way. I'm looking for a way to buffer the pipeline items and, for example, when we receive 1000 items, execute a bulk insert. How can I achieve that?
Have the pipeline store items in a list, and insert them when they reach a certain length, and on spider closing.
class Pipeline(object):
def __init__(self):
super(Pipeline, self).__init__()
self.items = []
def process_item(self, item, spider):
self.items.append(item)
if len(self.items) >= 1000:
self.insert_current_items()
return item
def insert_current_items(self):
items = self.items
self.items = []
self.insert_to_database(items)
def close_spider(self, spider):
self.insert_current_items()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With