Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiency of reopening sqlite database after each query

Tags:

python

sqlite

I currently am working on a web server in tornado, but am having issues with different bits of code trying to access the database at once.

I have simplified this by simply having a query function which basically does this (but slightly more advanced):

def query(command, arguments = []):
    db = sqlite3.open("models/data.db")
    cursor = db.cursor()
    cursor.execute(command, arguments)
    result = cursor.findall()
    db.close()
    return result

I'm just wondering how efficient it is to reopen the database after every query (I would guess it is a very large constant time operation, or would it cache things or something?), and whether there's a better way to do this.

like image 402
matts Avatar asked Jan 24 '13 21:01

matts


People also ask

How can I Make my SQLite query more efficient?

Then the SQL engine would be able to use that index to efficiently find the matching rows. You can read this guide about SQLite query planning to help you understand when indexes would help. Creating indexes can often make repeated queries more efficient.

What is SQLite performance tuning and why is it important?

SQLite is a popular, relational database that you embed into your application. With an increasing amount of data in your database, you need to apply SQLite performance tuning. This article discusses indices and its pitfalls, the use of the query planner, the Write-Ahead-Logging (WAL) journal mode and increasing the cache size.

What is the future of the query planner in SQLite?

The query planner in future versions of SQLite might grow smart enough to make transformations such as the above automatically, in both directions. That is to say, future versions of SQLite might transform queries of the first form into the second, or queries written the second way into the first.

Why is SQLite slower than other databases?

SQLite is slower than the other databases when it comes to dropping tables. This probably is because when SQLite drops a table, it has to go through and erase the records in the database file that deal with that table.


2 Answers

I'm adding my own answer because I disagree with the currently accepted one. It states that the operation is not thread-safe, but this is plain wrong - SQLite uses file locking appropriate to its current platform to ensure that all accesses comply with ACID.

On Unix systems this will be fcntl() or flock() locking, which is a per-filehandle lock. As a result, the code posted which makes a new connection each time will always allocate a new filehandle and hence SQLite's own locking will prevent database corruption. A consequence of this is that it's typically a bad idea to use SQLite on an NFS share or similar, as often these don't provide particularly reliable locking (it does depend on your NFS implementation, though).

As @abernert has already pointed out in comments, SQLite has had issues with threads, but this was related to sharing a single connection between threads. As he also mentions, this means if you use an application-wide pool you'll get runtime errors if a second thread pulls out a recycled connection from the pool. These are also the sort of irritating bugs which you might not notice in testing (light load, perhaps only a single thread in use), but which could easily cause headaches later. Martijn Pieters' later suggestion of a thread-local pool should work fine.

As outlined in the SQLite FAQ as of version 3.3.1 it's actually safe to pass connections between threads as long as they don't hold any locks - this was a concession that the author of SQLite added despite being critical of the use of threads in general. Any sensible connection pooling implementation will always ensure that everything has been either committed or rolled back prior to replacing the connection in the pool, so actually an application-global pool would likely be safe if it wasn't for the Python check against sharing, which I believe remains in place even if a more recent version of SQLite is used. Certainly my Python 2.7.3 system has an sqlite3 module with sqlite_version_info reporting 3.7.9, yet it still throws a RuntimeError if you access it from multiple threads.

In any case, while the check exists then connections can't effectively be shared even if the underlying SQLite library supports it.

As to your original question, certainly creating a new connection each time is less efficient than keeping a pool of connections, but has already been mentioned this would need to be a thread-local pool, which is a slight pain to implement. The overhead of creating a new connection to the database is essentially opening the file and reading the header to make sure it's a valid SQLite file. The overhead of actually executing a statement is higher as it needs to take out looks and perform quite a bit of file I/O, so the bulk of the work is actually deferred until statement execution and/or commit.

Interestingly, however, at least on the Linux systems I've looked at the code to execute statements repeats the steps of reading the file header - as a result, opening a new connection isn't going to be all that bad since the initial read when opening the connection will have pulled the header into the system's filesystem cache. So it boils down to the overhead of opening a single filehandle.

I should also add that if you're expecting your code to scale to high concurrency then SQLite might be a poor choice. As their own website points out it's not really suitable for high concurrency as the performance hit of having to squeeze all access through a single global lock starts to bite as the number of concurrent threads increases. It's fine if you're using threads for a convenience, but if you're really expecting a high degree of concurrency then I'd avoid SQLite.

In short, I don't think your approach of opening each time is actually all that bad. Could a thread-local pool improve performance? Probably, yes. Would this performance gain be noticeable? In my opinion, not unless you're seeing quite high connection rates, and at that point you'll have a lot of threads so you probably want to move away from SQLite anyway because it doesn't handle concurrency terribly well. If you do decide to use one, make sure it cleans up the connection before returning it to the pool - SQLAlchemy has some connection pooling functionality that you might find useful even if you don't want all the ORM layers on top.

EDIT

As quite reasonably pointed out I should attach real timings. These are from a fairly low powered VPS:

>>> timeit.timeit("cur = conn.cursor(); cur.execute('UPDATE foo SET name=\"x\"
    WHERE id=3'); conn.commit()", setup="import sqlite3;
    conn = sqlite3.connect('./testdb')", number=100000)
5.733098030090332
>>> timeit.timeit("conn = sqlite3.connect('./testdb'); cur = conn.cursor();
    cur.execute('UPDATE foo SET name=\"x\" WHERE id=3'); conn.commit()",
    setup="import sqlite3", number=100000)
16.518677949905396

You can see a factor of around 3x difference, which isn't insignificant. However, the absolute time is still sub-millisecond, so unless you do a lot of queries per request then there's probably other places to optimise first. If you do a lot of queries, a reasonable compromise might be a new connection per request (but without the complexity of a pool, just reconnect every time).

For reading (i.e. SELECT) then the relative overhead of connecting each time will be higher, but the absolute overhead in wall clock time should be consistent.

As has already been discussed elsewhere on this question, you should test with real queries, I just wanted to document what I'd done to come to my conclusions.

like image 184
Cartroo Avatar answered Sep 25 '22 17:09

Cartroo


If you want to know how inefficient something is, write a test and see for yourself.

Once I fixed the bugs to make your example work in the first place, and wrote the code to create a test case to run it against, figuring out how to time it with timeit was as trivial as it usually is.

See http://pastebin.com/rd39vkVa

So, what happens when you run it?

$ python2.7 sqlite_test.py 10000
reopen: 2.02089715004
reuse:  0.278793811798
$ python3.3 sqlite_test.py 10000
reopen: 1.8329595914110541
reuse:  0.2124928394332528
$ pypy sqlite_test.py 10000
reopen: 3.87628388405
reuse:  0.760829925537

So, opening the database takes about 4 to 8 times as long as running a dead-simple query against a near-empty table that returns nothing. There's your worst case.

like image 42
abarnert Avatar answered Sep 26 '22 17:09

abarnert