Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

the best way to close sqlite db cur and con when to write data into sqlite with pipelines

Tags:

sqlite

scrapy

quotes.py is the spider file.

import scrapy    
from project.items import ProjectItem    

class QuotesSpider(scrapy.Spider):
    name = 'quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/page/1']

    def parse(self, response):
        item = ProjectItem()
        for quote in response.css('div.quote'):
            item['quote'] = quote.css('span.text::text').extract_first()
            item['author'] = quote.xpath('span/small/text()').extract_first()
            yield item

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            yield response.follow(next_page, self.parse)

The author and quote in every page were extracted in item.

With the following pipelines.py ,item['author'] and item['quote'] can be written into /tmp/test.sqlite.

import sqlite3
import json

class ProjectPipeline(object):
    def __init__(self):
        self.db = r'/tmp/test.sqlite'
        self.table = 'data'
        self.con = sqlite3.connect(self.db)
        self.cur = self.con.cursor()
        self.cur.execute("create table {table} (author TEXT,quote  TEXT)".format(table=self.table))

    def __del__(self):
        self.cur.close()
        self.con.close()

    def process_item(self, item, spider):
        self.cur.execute("INSERT INTO {table} VALUES(?,?);".format(table=self.table),(item['author'],item['quote']))            
        self.con.commit()
        return item

There is a little shortcoming when to run sipder scrapy crawl quotes,the tempary sqlite file test.sqllite-journal always opened and colsed ,opened and colsed continuously when to open /tmp directory during spider's running time.
I want a better way to make the task done.
1.not committing after every item?
2.open sqlite only one time to write all data and then close it.

like image 462
showkey Avatar asked Sep 07 '17 02:09

showkey


1 Answers

Personally I see nothing wrong with your currect approach of commiting after every item. Commit should close logical unit of work (and rollback, on the other, should discard it in case of an error). In my eyes, scraped items are independent and can be thought of as logical units so commit is legitimate. (I would even say desired as it prevents for data loss in case of unexpected errors.)

Sure, you can do it as you suggest as well and store all the data once at the end. But for that, you have to store them in memory in the meantime and depending on items' size and count, it can be quite a lot of data. You can, however, choose a compromise and use buffer with known size and commit items when buffer is full. Look at this example of your modified pipeline class:

import sqlite3
import json                                                                                

class ProjectPipeline(object):
    def __init__(self):
        self.db = r'/tmp/test.sqlite'
        self.table = 'data'                                                     
        self.buff = list()                                                      
        self.buff_size = 20                                                     
        self.con = sqlite3.connect(self.db)                                     
        self.cur = self.con.cursor()                                            
        self.cur.execute("create table {table} (author TEXT,quote TEXT)".format(table=self.table))

    def __del__(self):                                                          
        if len(self.buff) > 0:                                                  
            self.cur.executemany("INSERT INTO {table} VALUES(?,?);".format(table=self.table),self.buff)
            self.con.commit()                                                   
        self.cur.close()                                                        
        self.con.close()                                                        

    def process_item(self, item, spider):                                       
        self.buff.append((item['author'],item['quote']))                        
        if len(self.buff) == self.buff_size:                                    
            self.cur.executemany("INSERT INTO {table} VALUES(?,?);".format(table=self.table),self.buff)
            self.con.commit()                                                   
            del self.buff[:]
        return item

The buffer size is set to 20 items so after every 20 items, they get commited in the database. The best is store such settings (together with database name etc.) in settings.py though.

like image 109
Tomáš Linhart Avatar answered Sep 30 '22 21:09

Tomáš Linhart