quotes.py is the spider file.
import scrapy
from project.items import ProjectItem
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/page/1']
def parse(self, response):
item = ProjectItem()
for quote in response.css('div.quote'):
item['quote'] = quote.css('span.text::text').extract_first()
item['author'] = quote.xpath('span/small/text()').extract_first()
yield item
next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
yield response.follow(next_page, self.parse)
The author and quote in every page were extracted in item.
With the following pipelines.py ,item['author'] and item['quote'] can be written into /tmp/test.sqlite
.
import sqlite3
import json
class ProjectPipeline(object):
def __init__(self):
self.db = r'/tmp/test.sqlite'
self.table = 'data'
self.con = sqlite3.connect(self.db)
self.cur = self.con.cursor()
self.cur.execute("create table {table} (author TEXT,quote TEXT)".format(table=self.table))
def __del__(self):
self.cur.close()
self.con.close()
def process_item(self, item, spider):
self.cur.execute("INSERT INTO {table} VALUES(?,?);".format(table=self.table),(item['author'],item['quote']))
self.con.commit()
return item
There is a little shortcoming when to run sipder scrapy crawl quotes
,the tempary sqlite file test.sqllite-journal
always opened and colsed ,opened and colsed continuously when to open /tmp
directory during spider's running time.
I want a better way to make the task done.
1.not committing after every item?
2.open sqlite only one time to write all data and then close it.
Personally I see nothing wrong with your currect approach of commiting after every item. Commit should close logical unit of work (and rollback, on the other, should discard it in case of an error). In my eyes, scraped items are independent and can be thought of as logical units so commit is legitimate. (I would even say desired as it prevents for data loss in case of unexpected errors.)
Sure, you can do it as you suggest as well and store all the data once at the end. But for that, you have to store them in memory in the meantime and depending on items' size and count, it can be quite a lot of data. You can, however, choose a compromise and use buffer with known size and commit items when buffer is full. Look at this example of your modified pipeline class:
import sqlite3
import json
class ProjectPipeline(object):
def __init__(self):
self.db = r'/tmp/test.sqlite'
self.table = 'data'
self.buff = list()
self.buff_size = 20
self.con = sqlite3.connect(self.db)
self.cur = self.con.cursor()
self.cur.execute("create table {table} (author TEXT,quote TEXT)".format(table=self.table))
def __del__(self):
if len(self.buff) > 0:
self.cur.executemany("INSERT INTO {table} VALUES(?,?);".format(table=self.table),self.buff)
self.con.commit()
self.cur.close()
self.con.close()
def process_item(self, item, spider):
self.buff.append((item['author'],item['quote']))
if len(self.buff) == self.buff_size:
self.cur.executemany("INSERT INTO {table} VALUES(?,?);".format(table=self.table),self.buff)
self.con.commit()
del self.buff[:]
return item
The buffer size is set to 20 items so after every 20 items, they get commited in the database. The best is store such settings (together with database name etc.) in settings.py
though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With