Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing MySQL query with SUM, date range and Group By

Tags:

mysql

I have following tables:

CREATE TABLE IF NOT EXISTS stats (
    date date NOT NULL DEFAULT '0000-00-00',
    cid int(8) NOT NULL DEFAULT '0',
    v bigint(15) NOT NULL DEFAULT '0',
    c bigint(15) NOT NULL DEFAULT '0',
    a bigint(15) NOT NULL DEFAULT '0',
PRIMARY KEY (date,cid),
KEY date (date),
KEY cid (cid),
KEY date_cid_vca (date,cid,v,c,a)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

this table has 30,842,712 rows

and

CREATE TABLE IF NOT EXISTS camp (
id int(8) NOT NULL AUTO_INCREMENT,
name varchar(80) NOT NULL DEFAULT '',
PRIMARY KEY (id,name)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

this table has 1985 rows

I have the following query:

SELECT
    c.id,
    c.name,
    SUM(s.v) AS sumv,
    SUM(s.c) AS sumc,
    GREATEST(((SUM(s.c)/SUM(s.v))*100.00), 0.00) AS cratio,
    SUM(s.a) AS suma,
    GREATEST(((SUM(s.a)/SUM(s.c))*100.00), 0.00) AS aratio
FROM
    stats s, camp c
WHERE
    s.date >= '2012-02-01' AND
    s.date <= '2012-02-29' AND
    c.id=s.cid
GROUP BY s.cid;

EXPLAIN shows:

+----+-------------+-------+-------+-------------------------------+--------------+---------+---------------------+---------+-----------------------------------------------------------+
| id | select_type | table | type  | possible_keys                 | key          | key_len | ref                 | rows    | Extra                                                     |
+----+-------------+-------+-------+-------------------------------+--------------+---------+---------------------+---------+-----------------------------------------------------------+
|  1 | SIMPLE      | s     | range | PRIMARY,date,cid,date_cid_vca | date_cid_vca | 3       | NULL                | 1010265 | Using where; Using index; Using temporary; Using filesort |
|  1 | SIMPLE      | c     | ref   | PRIMARY                       | PRIMARY      | 4       | db.s.cid            |       1 | Using index                                               |
+----+-------------+-------+-------+-------------------------------+--------------+---------+---------------------+---------+-----------------------------------------------------------+

Problem is that the query takes around 50 seconds to complete even though it uses indexes. Is there any other way to optimize the query?

Thanks!

like image 831
Paxxil Avatar asked Feb 17 '12 14:02

Paxxil


2 Answers

You've done well in optimizing this query with your indexes. I suppose you really do have over 1 million rows in stats in that date range. Unfortunately, joining (and then grouping) 1 million rows, even with a covering index, is a lot to ask from a database. For better performance, you will need to beef up the hardware, start denormalizing (put camp inside of stats to avoid the join), or keep running totals for each camp instead of calculating it on the fly.

Edit

Since removing the 1 million+ joins seemed to have made a large impact, you can try something like this:

SELECT c.*, a.* FROM
(SELECT
    SUM(s.v) AS sumv,
    SUM(s.c) AS sumc,
    GREATEST(((SUM(s.c)/SUM(s.v))*100.00), 0.00) AS cratio,
    SUM(s.a) AS suma,
    GREATEST(((SUM(s.a)/SUM(s.c))*100.00), 0.00) AS aratio,
    s.cid
FROM
    stats s
WHERE
    s.date >= '2012-02-01'
   AND s.date <= '2012-02-29'
GROUP BY s.cid) a
JOIN
  camp c
  ON c.id = a.cid

This query does the join on the smaller result set.

like image 191
Marcus Adams Avatar answered Sep 24 '22 05:09

Marcus Adams


The following query should allow it to use indexes more efficiently

SELECT
    c.id,
    c.name,
    SUM(s.v) AS sumv,
    SUM(s.c) AS sumc,
    GREATEST(((SUM(s.c)/SUM(s.v))*100.00), 0.00) AS cratio,
    SUM(s.a) AS suma,
    GREATEST(((SUM(s.a)/SUM(s.c))*100.00), 0.00) AS aratio
FROM
    camp c
INNER JOIN
    stats s
ON
    s.cid = c.id
    AND s.date BETWEEN '2012-02-01' AND '2012-02-29'

GROUP BY c.id;

Also I would consider deleting the date_cid_vca key as it is just containing the entire table and so not particularly useful. The above query should use the PK to match rows from stats to camp based on date and cid and so while it'shard to be 100% certain without access to your database, I'm fairly confident that the above will improve your response times

like image 40
Simon at My School Portal Avatar answered Sep 23 '22 05:09

Simon at My School Portal