Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster sqlite 3 query in go? I need to process 1million+ rows as fast as possible

Tags:

What's the fastest way to read a sqlite3 table in golang?

package main  import (     "fmt"     "database/sql"     _ "github.com/mattn/go-sqlite3"     "log"     "time" )  func main() {     start := time.Now()      db, err := sql.Open("sqlite3", "/Users/robertking/go/src/bitbucket.org/thematicanalysis/optimization_test/robs.db")     if err != nil {         log.Fatal(err)     }     defer db.Close()     rows, err := db.Query("select * from data")     if err != nil {         log.Fatal(err)     }     defer rows.Close()      for rows.Next() {     }     err = rows.Err()     if err != nil {         log.Fatal(err)     }     fmt.Println(time.Since(start)) } 

This takes 8 seconds in Go because .Next is slow. In python a fetchall takes only 4 seconds! I'm rewriting in GO to gain performance not lose performance.

Here is the python code, I couldn't find an equivalent of fetchall in go:

import time  start = time.time() import sqlite3 conn = sqlite3.connect('/Users/robertking/go/src/bitbucket.org/thematicanalysis/optimization_test/robs.db') c = conn.cursor() c.execute("SELECT * FROM data") x = c.fetchall() print time.time() - start 

Edit: adding bounty. I'm reading the data in go, python and C, here are results. Dont want to use C, but will stick with python if GO isnt faster.:

py: 2.45s go: 2.13s (using github.com/mxk/go-sqlite/sqlite3 instead of github.com/mattn/go-sqlite3) c:  0.32s 

I feel like go should be closer to the c side of thing? anyone know how to make it faster? is it possible to avoid mutex with readonly mode?

edit:

It seems like all the sqlite3 implementations are slow (too much reflection and too many cgo calls for conversions). So i'll have to just write my own interface.

Here's the schema:

CREATE TABLE mytable (   c0   REAL,   c1   INTEGER,   c15  TEXT,   c16  TEXT,   c17  TEXT,   c18  TEXT,   c19  TEXT,   c47  TEXT,   c74  REAL DEFAULT 0,   c77  TEXT,   c101 TEXT,   c103 TEXT,   c108 TEXT,   c110 TEXT,   c125 TEXT,   c126 TEXT,   c127 REAL DEFAULT 0,   x    INTEGER     PRIMARY KEY ); 

and the query is dynamic but usually something like this:

SELECT c77,c77,c125,c126,c127,c74 from mytable 

edit:

looks like i'll fork the sqlite3 implementation and make some methods that focus on performance,

this is an example of some code which is much faster:.

package main   /*  #cgo LDFLAGS: -l sqlite3  #include "sqlite3.h" */ import "C"  import (     //"database/sql"     "log"     "reflect"     "unsafe" )  type Row struct {     v77 string     v125 string     v126 string     v127 float64     v74 float64 }  // cStr returns a pointer to the first byte in s. func cStr(s string) *C.char {     h := (*reflect.StringHeader)(unsafe.Pointer(&s))     return (*C.char)(unsafe.Pointer(h.Data)) }  func main() {     getDataFromSqlite() }  func getDataFromSqlite() {     var db *C.sqlite3     name := "../data_dbs/all_columns.db"     rc := C.sqlite3_open_v2(cStr(name+"\x00"), &db, C.SQLITE_OPEN_READONLY, nil)    var stmt *C.sqlite3_stmt;   rc = C.sqlite3_prepare_v2(db, cStr("SELECT c77,c125,c126,c127,c74 from data\x00"), C.int(-1), &stmt, nil);   rc = C.sqlite3_reset(stmt);      var result C.double     result = 0.0     rc = C.sqlite3_step(stmt)     for rc == C.SQLITE_ROW {     C.GoString((*C.char)(unsafe.Pointer(C.sqlite3_column_text(stmt, 0))))     C.GoString((*C.char)(unsafe.Pointer(C.sqlite3_column_text(stmt, 1))))     C.GoString((*C.char)(unsafe.Pointer(C.sqlite3_column_text(stmt, 2))))     C.sqlite3_column_double(stmt, 3)     result += C.sqlite3_column_double(stmt, 4)         rc = C.sqlite3_step(stmt)   }     log.Println(result) } 
like image 915
robert king Avatar asked Dec 28 '17 02:12

robert king


People also ask

How many rows can SQLite handle?

The theoretical maximum number of rows in a table is 264 (18446744073709551616 or about 1.8e+19). This limit is unreachable since the maximum database size of 281 terabytes will be reached first.


2 Answers

Introduction

My assumption was that we have a problem with how the performance is measured here, so I wrote a little Go program to generate records and save them into a SQLite database as well as a Python and Go implementation of a little task to do on those records.

You can find the according repository at https://github.com/mwmahlberg/sqlite3perf

The data model

The records generated consist of

  • ID: A row ID generated by SQLite
  • rand: A hex encoded 8 byte pseudo-random value
  • hash: A hex encoded SHA256 hash of the unencoded rand

The table's schema is relatively simple:

sqlite> .schema CREATE TABLE bench (ID int PRIMARY KEY ASC, rand TEXT, hash TEXT); 

First I generated 1.5M records and vacuumed the sqlite database afterwards with

$ ./sqlite3perf generate -r 1500000 -v 

Next I called the Go implementation against those 1.5M records. Both the Go as well as the Python implementation basically do the same simple task:

  1. Read all entries from the database.
  2. For each row, decode the random value from hex, then create a SHA256 hex from the result.
  3. Compare the generated SHA256 hex string against the one stored in the database
  4. If they match, continue, otherwise break.

Assumptions

My assumption explicitly was that Python did some type of lazy loading and/or possibly even execution of the SQL query.

The results

Go implementation

$ ./sqlite3perf bench 2017/12/31 15:21:48 bench called 2017/12/31 15:21:48 Time after query: 4.824009ms 2017/12/31 15:21:48 Beginning loop 2017/12/31 15:21:48 Acessing the first result set      ID 0,     rand: 6a8a4ad02e5e872a,     hash: 571f1053a7c2aaa56e5c076e69389deb4db46cc08f5518c66a4bc593e62b9aa4 took 548.32µs 2017/12/31 15:21:50 641,664 rows processed 2017/12/31 15:21:52 1,325,186 rows processed 2017/12/31 15:21:53 1,500,000 rows processed 2017/12/31 15:21:53 Finished loop after 4.519083493s 2017/12/31 15:21:53 Average 3.015µs per record, 4.523936078s overall 

Note the values for "time after query" ( the time the query command took to return) and the time it took to access the first result set after the iteration over the result set was started.

Python implementation

$ python bench.py  12/31/2017 15:25:41 Starting up 12/31/2017 15:25:41 Time after query: 1874µs 12/31/2017 15:25:41 Beginning loop 12/31/2017 15:25:44 Accessing first result set     ID: 0     rand: 6a8a4ad02e5e872a     hash: 571f1053a7c2aaa56e5c076e69389deb4db46cc08f5518c66a4bc593e62b9aa4 took 2.719312 s 12/31/2017 15:25:50 Finished loop after 9.147431s 12/31/2017 15:25:50 Average: 6.098µs per record, 0:00:09.149522 overall 

Again, note the value for "time after query" and the time it took to access the first result set.

Summary

It took the Go implementation quite a while to return after the SELECT query was send, while Python seemed to be blazing fast in comparison. However, from the time it took to actually access the first result set, we can see that the Go implementation is more than 500 times faster to actually access the first result set (5.372329ms vs 2719.312ms) and about double as fast for the task at hand as the Python implementation.

Notes

  • In order to prove the assumption that Python actually does lazy loading on the result set, each and every row and column had to be accessed in order to make sure that Python is forced to actually read the value from the database.
  • I chose a hashing task because presumably the implementation of SHA256 is highly optimised in both languages.

Conclusion

Python does seem to do lazy loading of result sets and possibly does not even execute a query unless the according result set is actually accessed. In this simulated scenario, mattn's SQLite driver for Go outperforms Python's by between roughly 100% and orders of magnitude, depending on what you want to do.

Edit: So in order to have a fast processing, implement your task in Go. While it takes longer to send the actual query, accessing the individual rows of the result set is by far faster. I'd suggest starting out with a small subset of your data, say 50k records. Then, to further improve your code, use profiling to identify your bottlenecks. Depending on what you want to do during processing, pipelines for example might help, but how to improve the processing speed of the task at hand is difficult to say without actual code or a thorough description.

like image 188
Markus W Mahlberg Avatar answered Nov 13 '22 22:11

Markus W Mahlberg


Scanning Values from the retrieved rows example read step 10.
Since Query() & QueryRow() returns a pointer to Rows and pointer to Row from the database query respectively, we can use the Scan() functions on Rows and Row struct to get access to the values in the rows struct.

for rows.Next() {   var empID sql.NullInt64  var empName sql.NullString  var empAge sql.NullInt64  var empPersonId sql.NullInt64   if err := rows.Scan(&empID, &empName, &empAge,                             &empPersonId); err != nil {           log.Fatal(err)  }   fmt.Printf("ID %d with personID:%d & name %s is age %d\n",                           empID.Int64, empPersonId.Int64, empName.String, empAge.Int64) } 

We also used the Scan() function from Row struct. Scan() is the only method declared in the Row Struct.

func (r *Row) Scan(dest ...interface{}) error 
like image 23
stefan Avatar answered Nov 13 '22 20:11

stefan