I've got time-series data that I'd like to store in a database of the format: <ul> <li>group : string</li> <li>date : date</li> <li>val1 : number</li> <li>val2 : number</li> <li>... valN</li> </ul> This database will be almost all reads. The searches will be for rows that belong to a group that is within a date range (e.g. group = XXX and date >= START and date <= END). The data-set is big. Hundreds of millions of rows. Will SQLite be able to easily handle this kind of data? The appealing thing about SQLite is that it is serverless and I'd like to use it if I can.

Updated Answer I tried putting the 100 million record database on a RAM disk - I don't mean an in-memory database, I mean RAM formatted into a filesystem with a the database on it and the same query takes 11 seconds rather than 147 seconds like below!!! This may be an option for you if, as you say, your queries are largely read-only - you could copy your database from disk to a RAMdrive in the mornings and access it nice and fast on the RAMdrive all day, or until reboot, without worrying about losing it if the power goes or the machine crashes. You could then run any writes against the disk-based copy and it would reflect the changes next time you copied the database to RAM whenever you wanted to. Original Answer I created three test databases modelled on your needs and inserted 1 million, 10 million and 100 million records in each of them respectively using the following code in Perl. I synthesised group names just with the prefix "GROUP-" and the record number, and generated random dates between 1900 and 2000, The value data was random. <pre class="prettyprint"><code>#!/usr/bin/perl use strict; use DBI; my $dsn = "dbi:SQLite:dbname=test.db"; my $user = ''; my $password = ''; my %attr = ( RaiseError => 1, AutoCommit => 0 ); my $dbh = DBI->connect($dsn, $user, $password, \%attr) or die "Can't connect to database: $DBI::errstr"; $dbh->do("DROP TABLE IF EXISTS TimeSeries;"); $dbh->do("CREATE TABLE TimeSeries (grp TEXT, date TEXT, val1 INTEGER, val2 INTEGER, val3 INTEGER, val4 INTEGER, PRIMARY KEY(grp,date))"); my $sql = qq{ INSERT INTO TimeSeries VALUES ( ?, ?, ?, ?, ?, ? ) }; my $sth = $dbh->prepare( $sql ); for(my $i=0;$i<100000000;$i++){ # Synthesize a group my $group=sprintf("GROUP-%d",$i); $sth->bind_param(1,$group); # Generate random date between 1900-2000 my $year=int(rand(100))+1900; my $month=int(rand(12))+1; my $day=int(rand(28)+1); my $date=sprintf("%d-%02d-%02d 00:00:00.0",$year,$month,$day); $sth->bind_param(2,$date); $sth->bind_param(3,int(rand(1000000))); $sth->bind_param(4,int(rand(1000000))); $sth->bind_param(5,int(rand(1000000))); $sth->bind_param(6,int(rand(1000000))); $sth->execute(); if(($i % 1000)==0){printf "$i\n";$dbh->commit();} } $dbh->commit(); $sth->finish(); $dbh->disconnect(); </code></pre> The file sizes come out like this for 1m, 10m and 100m records: <pre class="prettyprint"><code>-rw-r--r-- 1 mark staff 103M 4 Feb 14:16 1m.db -rw-r--r-- 1 mark staff 1.0G 4 Feb 14:18 10m.db -rw-r--r-- 1 mark staff 11G 4 Feb 15:10 100m.db </code></pre> A few random records look like this: <pre class="prettyprint"><code>GROUP-794|1927-12-14 00:00:00.0|233545|700623|848770|61504 GROUP-797|1927-06-13 00:00:00.0|315357|246334|276825|799325 GROUP-840|1927-09-28 00:00:00.0|682335|5651|879688|247996 GROUP-907|1927-05-19 00:00:00.0|148547|595716|516884|820007 GROUP-1011|1927-06-01 00:00:00.0|793543|479096|433073|786200 </code></pre> I then tried querying all records from 1927, like this: <pre class="prettyprint"><code>time sqlite3 1m.db 'select * from timeseries where date between "1927-01-01" and "1927-12-31"' </code></pre> The query times were as follows: <pre class="prettyprint"><code>all records in year 1927 from 1m record database => 2.7 seconds all records in year 1927 from 10m record database => 14 seconds all records in year 1927 from 100m record database => 147 seconds </code></pre> Those numbers look pretty linear to me, but this is a simplistic test and you may want to take the code and play around with it some more... P.S. I am running on a fairly decent spec iMac with SSD.

Suitability/Performance of SQLite for large time-series data

1 Answers

Updated Answer

I tried putting the 100 million record database on a RAM disk - I don't mean an in-memory database, I mean RAM formatted into a filesystem with a the database on it and the same query takes 11 seconds rather than 147 seconds like below!!! This may be an option for you if, as you say, your queries are largely read-only - you could copy your database from disk to a RAMdrive in the mornings and access it nice and fast on the RAMdrive all day, or until reboot, without worrying about losing it if the power goes or the machine crashes. You could then run any writes against the disk-based copy and it would reflect the changes next time you copied the database to RAM whenever you wanted to.

Original Answer

I created three test databases modelled on your needs and inserted 1 million, 10 million and 100 million records in each of them respectively using the following code in Perl. I synthesised group names just with the prefix "GROUP-" and the record number, and generated random dates between 1900 and 2000, The value data was random.

#!/usr/bin/perl
use strict;
use DBI;

my $dsn = "dbi:SQLite:dbname=test.db";
my $user = '';
my $password = '';
my %attr = ( RaiseError => 1, AutoCommit => 0 );

my $dbh = DBI->connect($dsn, $user, $password, \%attr) 
    or die "Can't connect to database: $DBI::errstr";

    $dbh->do("DROP TABLE IF EXISTS TimeSeries;");
    $dbh->do("CREATE TABLE TimeSeries (grp TEXT, date TEXT, val1 INTEGER, val2 INTEGER, val3 INTEGER, val4 INTEGER, PRIMARY KEY(grp,date))");

my $sql = qq{ INSERT INTO TimeSeries VALUES ( ?, ?, ?, ?, ?, ? ) };
my $sth = $dbh->prepare( $sql );

for(my $i=0;$i<100000000;$i++){
      # Synthesize a group
      my $group=sprintf("GROUP-%d",$i);
      $sth->bind_param(1,$group);

      # Generate random date between 1900-2000
      my $year=int(rand(100))+1900;
      my $month=int(rand(12))+1;
      my $day=int(rand(28)+1);
      my $date=sprintf("%d-%02d-%02d 00:00:00.0",$year,$month,$day);
      $sth->bind_param(2,$date);

      $sth->bind_param(3,int(rand(1000000)));
      $sth->bind_param(4,int(rand(1000000)));
      $sth->bind_param(5,int(rand(1000000)));
      $sth->bind_param(6,int(rand(1000000)));
      $sth->execute();
      if(($i % 1000)==0){printf "$i\n";$dbh->commit();}
}
$dbh->commit();
$sth->finish();
$dbh->disconnect();

The file sizes come out like this for 1m, 10m and 100m records:

-rw-r--r--  1 mark  staff   103M  4 Feb 14:16 1m.db
-rw-r--r--  1 mark  staff   1.0G  4 Feb 14:18 10m.db
-rw-r--r--  1 mark  staff    11G  4 Feb 15:10 100m.db

A few random records look like this:

GROUP-794|1927-12-14 00:00:00.0|233545|700623|848770|61504
GROUP-797|1927-06-13 00:00:00.0|315357|246334|276825|799325
GROUP-840|1927-09-28 00:00:00.0|682335|5651|879688|247996
GROUP-907|1927-05-19 00:00:00.0|148547|595716|516884|820007
GROUP-1011|1927-06-01 00:00:00.0|793543|479096|433073|786200

I then tried querying all records from 1927, like this:

time sqlite3 1m.db 'select * from timeseries where date between "1927-01-01" and "1927-12-31"'

The query times were as follows:

all records in year 1927 from 1m record database => 2.7 seconds
all records in year 1927 from 10m record database => 14 seconds
all records in year 1927 from 100m record database => 147 seconds

Those numbers look pretty linear to me, but this is a simplistic test and you may want to take the code and play around with it some more...

P.S. I am running on a fairly decent spec iMac with SSD.

147

answered Dec 11 '22 12:12

Mark Setchell

Related questions
                            
                                Why is my python/numpy example faster than pure C implementation?
                            
                                C# thread-safe getter performance differences
                            
                                How to improve performance of populating a massive tree view?
                            
                                Can having multiple threads on a single core system still improve performance?
                            
                                How to repeat each of a Python list's elements n times with itertools only?
                            
                                Java Settings for Netbeans 6.8 on OSX 10.5.8 to optimize cpu usage
                            
                                Java reduce CPU usage
                            
                                Best way to get distinct values from large table
                            
                                is_file/file_exists performance and cache
                            
                                Extremely slow performance of hide/show function when dealing with large number of elements
                            
                                Strange performance behavior
                            
                                Magento SOAP v1 vs. v2 performance
                            
                                Efficient way to create a circulant matrix in R
                            
                                2 threads slower than 1?
                            
                                Which is fastest: $.each, .ForEach, for-loop or something else? [closed]
                            
                                When should we use prefetch?
                            
                                For loop item unpacking
                            
                                Access stacktraces with good performance?
                            
                                Is it better to declare a local inside or outside a loop? [duplicate]
                            
                                First HTTPS connection is very slow in Internet Explorer 11, what can it be?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Suitability/Performance of SQLite for large time-series data

Tags:

performance

sqlite

Ana

People also ask

1 Answers

Mark Setchell

Recent Activity

Donate For Us