I've got time-series data that I'd like to store in a database of the format:
This database will be almost all reads. The searches will be for rows that belong to a group that is within a date range (e.g. group = XXX and date >= START and date <= END).
The data-set is big. Hundreds of millions of rows. Will SQLite be able to easily handle this kind of data? The appealing thing about SQLite is that it is serverless and I'd like to use it if I can.
Very large datasets An SQLite database is limited in size to 281 terabytes (248 bytes, 256 tibibytes). And even if it could handle larger databases, SQLite stores the entire database in a single disk file and many filesystems limit the maximum size of files to something less than this.
A million records is no problem. Implementation Limits For SQLite says: The theoretical maximum number of rows in a table is 2^64 (18446744073709551616 or about 1.8e+19). This limit is unreachable since the maximum database size of 140 terabytes will be reached first.
Instead of using the client-server database management system model, SQLite is self-contained in a single file. It is library, database, and data, all in one package. For certain applications, SQLite is a solid choice for a production database. It's lightweight, ultra-portable, and has no external dependencies.
Updated Answer
I tried putting the 100 million record database on a RAM disk - I don't mean an in-memory database, I mean RAM formatted into a filesystem with a the database on it and the same query takes 11 seconds rather than 147 seconds like below!!! This may be an option for you if, as you say, your queries are largely read-only - you could copy your database from disk to a RAMdrive in the mornings and access it nice and fast on the RAMdrive all day, or until reboot, without worrying about losing it if the power goes or the machine crashes. You could then run any writes against the disk-based copy and it would reflect the changes next time you copied the database to RAM whenever you wanted to.
Original Answer
I created three test databases modelled on your needs and inserted 1 million, 10 million and 100 million records in each of them respectively using the following code in Perl. I synthesised group names just with the prefix "GROUP-" and the record number, and generated random dates between 1900 and 2000, The value data was random.
#!/usr/bin/perl
use strict;
use DBI;
my $dsn = "dbi:SQLite:dbname=test.db";
my $user = '';
my $password = '';
my %attr = ( RaiseError => 1, AutoCommit => 0 );
my $dbh = DBI->connect($dsn, $user, $password, \%attr)
or die "Can't connect to database: $DBI::errstr";
$dbh->do("DROP TABLE IF EXISTS TimeSeries;");
$dbh->do("CREATE TABLE TimeSeries (grp TEXT, date TEXT, val1 INTEGER, val2 INTEGER, val3 INTEGER, val4 INTEGER, PRIMARY KEY(grp,date))");
my $sql = qq{ INSERT INTO TimeSeries VALUES ( ?, ?, ?, ?, ?, ? ) };
my $sth = $dbh->prepare( $sql );
for(my $i=0;$i<100000000;$i++){
# Synthesize a group
my $group=sprintf("GROUP-%d",$i);
$sth->bind_param(1,$group);
# Generate random date between 1900-2000
my $year=int(rand(100))+1900;
my $month=int(rand(12))+1;
my $day=int(rand(28)+1);
my $date=sprintf("%d-%02d-%02d 00:00:00.0",$year,$month,$day);
$sth->bind_param(2,$date);
$sth->bind_param(3,int(rand(1000000)));
$sth->bind_param(4,int(rand(1000000)));
$sth->bind_param(5,int(rand(1000000)));
$sth->bind_param(6,int(rand(1000000)));
$sth->execute();
if(($i % 1000)==0){printf "$i\n";$dbh->commit();}
}
$dbh->commit();
$sth->finish();
$dbh->disconnect();
The file sizes come out like this for 1m, 10m and 100m records:
-rw-r--r-- 1 mark staff 103M 4 Feb 14:16 1m.db
-rw-r--r-- 1 mark staff 1.0G 4 Feb 14:18 10m.db
-rw-r--r-- 1 mark staff 11G 4 Feb 15:10 100m.db
A few random records look like this:
GROUP-794|1927-12-14 00:00:00.0|233545|700623|848770|61504
GROUP-797|1927-06-13 00:00:00.0|315357|246334|276825|799325
GROUP-840|1927-09-28 00:00:00.0|682335|5651|879688|247996
GROUP-907|1927-05-19 00:00:00.0|148547|595716|516884|820007
GROUP-1011|1927-06-01 00:00:00.0|793543|479096|433073|786200
I then tried querying all records from 1927, like this:
time sqlite3 1m.db 'select * from timeseries where date between "1927-01-01" and "1927-12-31"'
The query times were as follows:
all records in year 1927 from 1m record database => 2.7 seconds
all records in year 1927 from 10m record database => 14 seconds
all records in year 1927 from 100m record database => 147 seconds
Those numbers look pretty linear to me, but this is a simplistic test and you may want to take the code and play around with it some more...
P.S. I am running on a fairly decent spec iMac with SSD.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With