I work as a scientist and I have used MySQL as a storage for the results of my numerical simulations. Typically I have a set of data obtained by my experiment and a control set. These two data sets are stored in one table. One indicator field tells me if a record comes from experiment or from a control set. This table usually has ~ 100 million records. 50million experiments and 50 million controls.
When I do the post processing of my data my typical task consists of first issuing the following two queries:
select b0,t0 from results_1mregr_c_ew_f where RC='E' and df>60 /// getting experiments data
and
select b0,t0 from results_1mregr_c_ew_f where RC='C' and df>60 /// getting controls data
I have a multi column index on RC,df. These queries take lots of time and the queries spend most of the time "Sending data"
I'm running this on 8core MacPro with 12GB of RAM. I'm a single user of this machine and this task is the primary task therefore I can dedicate all the RAM to MySQL. All tables are MyISAM (I can convert them if that would increase teh speed of my queries).
I would appreciate any recommendations on how to speed up these queries. Should I change some settings, indices, queries....
In each of these queries I expect to get back ~ 50 million records. Note that splitting the table into two tables one containing experimental and one containing control observation is not an option due to administrative reasons.
Here is the output of:
explain select b0, t0 from results_1mregr_c_ew_f where RC="C" and df>60
+----+-----------+---------------------+-----+-------------+---+-------+----+-------+-----------+
| id |select_type|table |type |possible_keys|key|key_len|ref |rows |Extra |
+----+-----------+---------------------+-----+-------------+---+-------+----+-------+-----------+
| 1 |SIMPLE |results_1mregr_c_ew_f|range|ff |ff |11 |NULL|6251121|Using where|
+----+-----------+---------------------+-----+-------------+---+-------+----+-------+-----------+
Here is the output from:
show indexes from results_1mregr_c_ew_f;
+-----------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-----------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| results_1mregr_c_ew_f | 0 | PRIMARY | 1 | id | A | 50793996 | NULL | NULL | | BTREE | |
| results_1mregr_c_ew_f | 1 | ff | 1 | RC | A | 3 | NULL | NULL | | BTREE | |
| results_1mregr_c_ew_f | 1 | ff | 2 | df | A | 120 | NULL | NULL | | BTREE | |
+-----------------------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
Here is the output from:
CREATE TABLE `results_1mregr_c_ew_f` (
`b0` double NOT NULL COMMENT ' ',
`s0` double NOT NULL,
`t0` double NOT NULL,
`b1` double NOT NULL,
`s1` double NOT NULL,
`t1` double NOT NULL,
`b2` double NOT NULL,
`s2` double NOT NULL,
`t2` double NOT NULL,
`b3` double NOT NULL,
`s3` double NOT NULL,
`t3` double NOT NULL,
`b4` double NOT NULL,
`s4` double NOT NULL,
`t4` double NOT NULL,
`AD` char(4) NOT NULL,
`chisq` double NOT NULL,
`RC` char(7) NOT NULL,
`colq` varchar(255) NOT NULL,
`df` int(11) NOT NULL,
`ncol` int(11) NOT NULL,
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`p1` float NOT NULL,
`p2` float NOT NULL,
`p3` float NOT NULL,
`p4` float NOT NULL,
PRIMARY KEY (`id`),
KEY `ff` (`RC`,`df`)
) ENGINE=MyISAM AUTO_INCREMENT=50793997 DEFAULT CHARSET=ascii |
Can MySQL handle 100 million records? Yeah, it can handle billions of records. If you properly index tables, they fit in memory and your queries are written properly then it shouldn't be an issue.
Something has to be seriously wrong for your query to be taking 2 hours to execute when I can do the same thing in under 60 seconds on similar hardware.
Some of the following might prove helpful...
Check your server configuration and optimise accordingly. Some of the following resources should be useful.
Now for the less obvious...
Why not process all the data inside of MySQL so you don't have to send vast quantities of data to your application layer ? The following example uses a cursor to loop and process 50M rows server side in under 2 minutes. I'm not a huge fan of cursors, especially in MySQL where they are very limited, but I'm guessing you'd be looping the resultset and doing some form of numerical analysis so use of a cursor is justifiable in this case.
Simplified myisam results table - keys based on yours.
drop table if exists results_1mregr_c_ew_f;
create table results_1mregr_c_ew_f
(
id int unsigned not null auto_increment primary key,
rc tinyint unsigned not null,
df int unsigned not null default 0,
val double(10,4) not null default 0,
ts timestamp not null default now(),
key (rc, df)
)
engine=myisam;
I generated 100M rows of data with the key fields having approximately the same cardinality as in your example:
show indexes from results_1mregr_c_ew_f;
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Index_type
===== ========== ======== ============ =========== ========= =========== ==========
results_1mregr_c_ew_f 0 PRIMARY 1 id A 100000000 BTREE
results_1mregr_c_ew_f 1 rc 1 rc A 2 BTREE
results_1mregr_c_ew_f 1 rc 2 df A 223 BTREE
Stored procedure
I created a simple stored procedure that fetches the required data and processes it (uses same where condition as your example)
drop procedure if exists process_results_1mregr_c_ew_f;
delimiter #
create procedure process_results_1mregr_c_ew_f
(
in p_rc tinyint unsigned,
in p_df int unsigned
)
begin
declare v_count int unsigned default 0;
declare v_done tinyint default 0;
declare v_id int unsigned;
declare v_result_cur cursor for select id from results_1mregr_c_ew_f where rc = p_rc and df > p_df;
declare continue handler for not found set v_done = 1;
open v_result_cur;
repeat
fetch v_result_cur into v_id;
set v_count = v_count + 1;
-- do work...
until v_done end repeat;
close v_result_cur;
select v_count as counter;
end #
delimiter ;
The following runtimes were observed:
call process_results_1mregr_c_ew_f(0,60);
runtime 1 = 03:24.999 Query OK (3 mins 25 secs)
runtime 2 = 03:32.196 Query OK (3 mins 32 secs)
call process_results_1mregr_c_ew_f(1,60);
runtime 1 = 04:59.861 Query OK (4 mins 59 secs)
runtime 2 = 04:41.814 Query OK (4 mins 41 secs)
counter
========
23000002 (23 million rows processed in each case)
Hmmmm, performance a bit disappointing so onto the next idea.
Why innodb ?? because it has clustered indexes ! You will find inserting slower using innodb but hopefully it will be faster to read so it's a trade off that might be worth it.
Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record. For example, MyISAM uses one file for data rows and another for index records.
More info here :
Simplified innodb results table
drop table if exists results_innodb;
create table results_innodb
(
rc tinyint unsigned not null,
df int unsigned not null default 0,
id int unsigned not null, -- cant auto_inc this !!
val double(10,4) not null default 0,
ts timestamp not null default now(),
primary key (rc, df, id) -- note clustered (innodb only !) composite PK
)
engine=innodb;
One problem with innodb is that is doesnt support auto_increment fields that form part of a composite key so you'd have to provide the incrementing key value yourself using a sequence generator, trigger or some other method - perhaps in the application populating the result table itself ??
Again, I generated 100M rows of data with the key fields having approximately the same cardinality as in your example. Don't worry if these figures don't match the myisam example as innodb estimates the cardinalities so they wont be exactly the same. (but they are - same dataset used)
show indexes from results_innodb;
Table Non_unique Key_name Seq_in_index Column_name Collation Cardinality Index_type
===== ========== ======== ============ =========== ========= =========== ==========
results_innodb 0 PRIMARY 1 rc A 18 BTREE
results_innodb 0 PRIMARY 2 df A 18 BTREE
results_innodb 0 PRIMARY 3 id A 100000294 BTREE
Stored procedure
The stored procedure is exactly the same as the myisam example above but selects data from the innodb table instead.
declare v_result_cur cursor for select id from results_innodb where rc = p_rc and df > p_df;
The results are as follows:
call process_results_innodb(0,60);
runtime 1 = 01:53.407 Query OK (1 mins 53 secs)
runtime 2 = 01:52.088 Query OK (1 mins 52 secs)
call process_results_innodb(1,60);
runtime 1 = 02:01.201 Query OK (2 mins 01 secs)
runtime 2 = 01:49.737 Query OK (1 mins 50 secs)
counter
========
23000002 (23 million rows processed in each case)
approx 2-3 mins faster than the myisam engine implementation ! (innodb FTW)
Processing the results in a server side stored procedure that uses a cursor might not be an optimal solution especially as MySQL doesnt have support for things such arrays and complex data structures that are readily available in 3GL languages such as C# etc or even in other databases such as Oracle PL/SQL.
So the idea here is to return batches of data to an application layer (C# whatever) which can then add the results to a collection based data structure and then process the data internally.
Stored procedure
The stored procedure takes 3 paramaters rc, df_low and df_high which allows you to select a range of data as follows:
call list_results_innodb(0,1,1); -- df 1
call list_results_innodb(0,1,10); -- df between 1 and 10
call list_results_innodb(0,60,120); -- df between 60 and 120 etc...
obviously the higher the df range the more data you'll be extracting.
drop procedure if exists list_results_innodb;
delimiter #
create procedure list_results_innodb
(
in p_rc tinyint unsigned,
in p_df_low int unsigned,
in p_df_high int unsigned
)
begin
select rc, df, id from results_innodb where rc = p_rc and df between p_df_low and p_df_high;
end #
delimiter ;
I also knocked up a myisam version also which is identical except for the table that is used.
call list_results_1mregr_c_ew_f(0,1,1);
call list_results_1mregr_c_ew_f(0,1,10);
call list_results_1mregr_c_ew_f(0,60,120);
Based on the cursor example above I would expect the innodb version to out-perform the myisam one.
I devloped a quick and dirty multi-threaded C# application that will call the stored procedure and add the results to a collection for post query processing. You dont have to use threads, the same batched query approach could be done sequentially without much loss of performance.
Each thread (QueryThread) selects a range of df data, loops the resultset and adds each result (row) to the results collection.
class Program
{
static void Main(string[] args)
{
const int MAX_THREADS = 12;
const int MAX_RC = 120;
List<AutoResetEvent> signals = new List<AutoResetEvent>();
ResultDictionary results = new ResultDictionary(); // thread safe collection
DateTime startTime = DateTime.Now;
int step = (int)Math.Ceiling((double)MAX_RC / MAX_THREADS) -1;
int start = 1, end = 0;
for (int i = 0; i < MAX_THREADS; i++){
end = (i == MAX_THREADS - 1) ? MAX_RC : end + step;
signals.Add(new AutoResetEvent(false));
QueryThread st = new QueryThread(i,signals[i],results,0,start,end);
start = end + 1;
}
WaitHandle.WaitAll(signals.ToArray());
TimeSpan runTime = DateTime.Now - startTime;
Console.WriteLine("{0} results fetched and looped in {1} secs\nPress any key", results.Count, runTime.ToString());
Console.ReadKey();
}
}
Runtime observed as follows:
Thread 04 done - 31580517
Thread 06 done - 44313475
Thread 07 done - 45776055
Thread 03 done - 46292196
Thread 00 done - 47008566
Thread 10 done - 47910554
Thread 02 done - 48194632
Thread 09 done - 48201782
Thread 05 done - 48253744
Thread 08 done - 48332639
Thread 01 done - 48496235
Thread 11 done - 50000000
50000000 results fetched and looped in 00:00:55.5731786 secs
Press any key
So 50 million rows fetched and added to a collection in under 60 seconds.
I tried the same thing using the myisam stored procedure which took 2 minutes to complete.
50000000 results fetched and looped in 00:01:59.2144880 secs
In my simplified system the myisam table doesnt perform too badly so it might not be worth migrating to innodb. If you do decided to copy your result data to an innodb table then do it as follows:
start transaction;
insert into results_innodb
select <fields...> from results_1mregr_c_ew_f order by <innodb primary key>;
commit;
Ordering the result by the innodb PK before inserting and wrapping the whole thing up in a transaction will speed things up.
I hope some of this proves helpful.
Good luck
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With