Efficiency of reopening sqlite database after each query

Tags:

sqlite

I currently am working on a web server in tornado, but am having issues with different bits of code trying to access the database at once.

I have simplified this by simply having a query function which basically does this (but slightly more advanced):

def query(command, arguments = []):
    db = sqlite3.open("models/data.db")
    cursor = db.cursor()
    cursor.execute(command, arguments)
    result = cursor.findall()
    db.close()
    return result

I'm just wondering how efficient it is to reopen the database after every query (I would guess it is a very large constant time operation, or would it cache things or something?), and whether there's a better way to do this.

402

asked Jan 24 '13 21:01

matts

2 Answers

I'm adding my own answer because I disagree with the currently accepted one. It states that the operation is not thread-safe, but this is plain wrong - SQLite uses file locking appropriate to its current platform to ensure that all accesses comply with ACID.

On Unix systems this will be fcntl() or flock() locking, which is a per-filehandle lock. As a result, the code posted which makes a new connection each time will always allocate a new filehandle and hence SQLite's own locking will prevent database corruption. A consequence of this is that it's typically a bad idea to use SQLite on an NFS share or similar, as often these don't provide particularly reliable locking (it does depend on your NFS implementation, though).

As @abernert has already pointed out in comments, SQLite has had issues with threads, but this was related to sharing a single connection between threads. As he also mentions, this means if you use an application-wide pool you'll get runtime errors if a second thread pulls out a recycled connection from the pool. These are also the sort of irritating bugs which you might not notice in testing (light load, perhaps only a single thread in use), but which could easily cause headaches later. Martijn Pieters' later suggestion of a thread-local pool should work fine.

As outlined in the SQLite FAQ as of version 3.3.1 it's actually safe to pass connections between threads as long as they don't hold any locks - this was a concession that the author of SQLite added despite being critical of the use of threads in general. Any sensible connection pooling implementation will always ensure that everything has been either committed or rolled back prior to replacing the connection in the pool, so actually an application-global pool would likely be safe if it wasn't for the Python check against sharing, which I believe remains in place even if a more recent version of SQLite is used. Certainly my Python 2.7.3 system has an sqlite3 module with sqlite_version_info reporting 3.7.9, yet it still throws a RuntimeError if you access it from multiple threads.

In any case, while the check exists then connections can't effectively be shared even if the underlying SQLite library supports it.

As to your original question, certainly creating a new connection each time is less efficient than keeping a pool of connections, but has already been mentioned this would need to be a thread-local pool, which is a slight pain to implement. The overhead of creating a new connection to the database is essentially opening the file and reading the header to make sure it's a valid SQLite file. The overhead of actually executing a statement is higher as it needs to take out looks and perform quite a bit of file I/O, so the bulk of the work is actually deferred until statement execution and/or commit.

Interestingly, however, at least on the Linux systems I've looked at the code to execute statements repeats the steps of reading the file header - as a result, opening a new connection isn't going to be all that bad since the initial read when opening the connection will have pulled the header into the system's filesystem cache. So it boils down to the overhead of opening a single filehandle.

I should also add that if you're expecting your code to scale to high concurrency then SQLite might be a poor choice. As their own website points out it's not really suitable for high concurrency as the performance hit of having to squeeze all access through a single global lock starts to bite as the number of concurrent threads increases. It's fine if you're using threads for a convenience, but if you're really expecting a high degree of concurrency then I'd avoid SQLite.

In short, I don't think your approach of opening each time is actually all that bad. Could a thread-local pool improve performance? Probably, yes. Would this performance gain be noticeable? In my opinion, not unless you're seeing quite high connection rates, and at that point you'll have a lot of threads so you probably want to move away from SQLite anyway because it doesn't handle concurrency terribly well. If you do decide to use one, make sure it cleans up the connection before returning it to the pool - SQLAlchemy has some connection pooling functionality that you might find useful even if you don't want all the ORM layers on top.

EDIT

As quite reasonably pointed out I should attach real timings. These are from a fairly low powered VPS:

>>> timeit.timeit("cur = conn.cursor(); cur.execute('UPDATE foo SET name=\"x\"
    WHERE id=3'); conn.commit()", setup="import sqlite3;
    conn = sqlite3.connect('./testdb')", number=100000)
5.733098030090332
>>> timeit.timeit("conn = sqlite3.connect('./testdb'); cur = conn.cursor();
    cur.execute('UPDATE foo SET name=\"x\" WHERE id=3'); conn.commit()",
    setup="import sqlite3", number=100000)
16.518677949905396

You can see a factor of around 3x difference, which isn't insignificant. However, the absolute time is still sub-millisecond, so unless you do a lot of queries per request then there's probably other places to optimise first. If you do a lot of queries, a reasonable compromise might be a new connection per request (but without the complexity of a pool, just reconnect every time).

For reading (i.e. SELECT) then the relative overhead of connecting each time will be higher, but the absolute overhead in wall clock time should be consistent.

As has already been discussed elsewhere on this question, you should test with real queries, I just wanted to document what I'd done to come to my conclusions.

184

answered Sep 25 '22 17:09

Cartroo

If you want to know how inefficient something is, write a test and see for yourself.

Once I fixed the bugs to make your example work in the first place, and wrote the code to create a test case to run it against, figuring out how to time it with timeit was as trivial as it usually is.

See http://pastebin.com/rd39vkVa

So, what happens when you run it?

$ python2.7 sqlite_test.py 10000
reopen: 2.02089715004
reuse:  0.278793811798
$ python3.3 sqlite_test.py 10000
reopen: 1.8329595914110541
reuse:  0.2124928394332528
$ pypy sqlite_test.py 10000
reopen: 3.87628388405
reuse:  0.760829925537

So, opening the database takes about 4 to 8 times as long as running a dead-simple query against a near-empty table that returns nothing. There's your worst case.

answered Sep 26 '22 17:09

abarnert

Related questions
                            
                                Python: Disable http_proxy in urllib2
                            
                                Python have slow db-query, but Perl not
                            
                                how to implement python spell checker using google's "did you mean?"
                            
                                Creating C structs in Cython
                            
                                In python, how does one efficiently find the largest consecutive set of numbers in a list that are not necessarily adjacent?
                            
                                Numpy error: invalid value encountered in power
                            
                                Executing a Python script in Apache2
                            
                                Big file compression with python
                            
                                Maximum of two tuples
                            
                                How do I output info to the console in a Gimp python script?
                            
                                smtplib sends blank message if the message contain certain characters
                            
                                Interleaving two numpy index arrays, one item from each array
                            
                                How do I check if a number is a 32-bit integer using Python?
                            
                                Selecting a subset of a Pandas DataFrame indexed by DatetimeIndex with a list of TimeStamps
                            
                                Can I force debugging python on AssertionError?
                            
                                Comparison of Python and R vocabularies
                            
                                How to save a pandas dataframe in gzipped format directly? [duplicate]
                            
                                how to document cython function on readthedocs
                            
                                How to enumerate enum members using SWIG
                            
                                How can I provide Sphinx documentation for a namedtuple (with autodoc)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With