We have a table in MySql with arround 30 million records, the following is table structure
CREATE TABLE `campaign_logs` (
`domain` varchar(50) DEFAULT NULL,
`campaign_id` varchar(50) DEFAULT NULL,
`subscriber_id` varchar(50) DEFAULT NULL,
`message` varchar(21000) DEFAULT NULL,
`log_time` datetime DEFAULT NULL,
`log_type` varchar(50) DEFAULT NULL,
`level` varchar(50) DEFAULT NULL,
`campaign_name` varchar(500) DEFAULT NULL,
KEY `subscriber_id_index` (`subscriber_id`),
KEY `log_type_index` (`log_type`),
KEY `log_time_index` (`log_time`),
KEY `campid_domain_logtype_logtime_subid_index` (`campaign_id`,`domain`,`log_type`,`log_time`,`subscriber_id`),
KEY `domain_logtype_logtime_index` (`domain`,`log_type`,`log_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
Following is my query
I'm doing UNION ALL instead of using IN operation
SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
count(DISTINCT subscriber_id) AS COUNT,
COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
AND campaign_id='123'
AND log_type = 'EMAIL_OPENED'
AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date
UNION ALL
SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
COUNT(DISTINCT subscriber_id) AS COUNT,
COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
AND campaign_id='123'
AND log_type = 'EMAIL_SENT'
AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date
UNION ALL
SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
COUNT(DISTINCT subscriber_id) AS COUNT,
COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
AND campaign_id='123'
AND log_type = 'EMAIL_CLICKED'
AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date,
Following is my Explain statement
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
| 1 | PRIMARY | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 55074 | Using where; Using index; Using filesort |
| 2 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 330578 | Using where; Using index; Using filesort |
| 3 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 1589 | Using where; Using index; Using filesort |
| NULL | UNION RESULT | <union1,2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
2.I removed COUNT(DISTINCT subscriber_id) from the query , then I got huge performance gain , I'm getting results in approx 1.5 sec, previously it was taking 50 sec - 1 minute. But I need distinct count of subscriber_id from the query
Following is explain when I remove COUNT(DISTINCT subscriber_id) from the query
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
| 1 | PRIMARY | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 55074 | Using where; Using index; Using temporary; Using filesort |
| 2 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 330578 | Using where; Using index; Using temporary; Using filesort |
| 3 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 1589 | Using where; Using index; Using temporary; Using filesort |
| NULL | UNION RESULT | <union1,2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
I could solve my performance problem by leaving out COUNT(DISTINCT...)
but I need those values. Is there a way to refactor my query, or add an index, or something, to get the COUNT(DISTINCT...)
values, but much faster?
UPDATE the following information is about data distribution of above table
for 1 domain 1 campaign 20 log_types 1k-200k subscribers
The above query I'm running for , the domain having 180k+ subscribers.
If the query without the count(distinct)
is going much faster, perhaps you can do nested aggregation:
SELECT log_type, log_date,
count(*) AS COUNT, sum(cnt) AS total
FROM (SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
subscriber_id, count(*) as cnt
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN = 'xxx' AND
campaign_id = '123' AND
log_type IN ('EMAIL_SENT', 'EMAIL_OPENED', 'EMAIL_CLICKED') AND
log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND
CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_type, log_date, subscriber_id
) l
GROUP BY logtype, log_date;
With a bit of luck, this will take 2-3 seconds rather than 50. However, you might need to break this out into subqueries, to get full performance. So, if this does not have a significant performance gain, change the in
back to =
one of the types. If that works, then the union all
may be necessary.
EDIT:
Another attempt is to use variables to enumerate the values before the group by
:
SELECT log_type, log_date, count(*) as cnt,
SUM(rn = 1) as sub_cnt
FROM (SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
subscriber_id,
(@rn := if(@clt = concat_ws(':', campaign_id, log_type, log_time), @rn + 1,
if(@clt := concat_ws(':', campaign_id, log_type, log_time), 1, 1)
)
) as rn
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index) CROSS JOIN
(SELECT @rn := 0)
WHERE DOMAIN = 'xxx' AND
campaign_id = '123' AND
log_type IN ('EMAIL_SENT', 'EMAIL_OPENED', 'EMAIL_CLICKED') AND
log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00', '+00:00', '+05:30') AND
CONVERT_TZ('2015-03-01 23:59:58', '+00:00', '+05:30')
ORDER BY log_type, log_date, subscriber_id
) t
GROUP BY log_type, log_date;
This still requires another sort of the data, but it might help.
To answer your question:
Is there a way to refactor my query, or add an index, or something, to get the COUNT(DISTINCT...) values, but much faster?
Yes, do not group by the calculated field (do not group by the result of the function). Instead, pre-calculate it, save it to the persistent column and include this persistent column into the index.
I would try to do the following and see if it changes performance significantly.
1) Simplify the query and focus on one part.
Leave only one longest running SELECT
out of the three, get rid of UNION
for the tuning period. Once the longest SELECT
is optimized, add more and check how the full query works.
2) Grouping by the result of the function doesn't let the engine use index efficiently.
Add another column to the table (at first temporarily, just to check the idea) with the result of this function. As far as I can see you want to group by 1 hour, so add column log_time_hour datetime
and set it to log_time
rounded/truncated to the nearest hour (preserve the date component).
Add index using new column: (domain, campaign_id, log_type, log_time_hour, subscriber_id)
. The order of first three columns in the index should not matter (because you use equality compare to some constant in the query, not the range), but make them in the same order as in the query. Or, better, make them in the index definition and in the query in the order of selectivity. If you have 100,000
campaigns, 1000
domains and 3
log types, then put them in this order: campaign_id, domain, log_type
. It should not matter much, but is worth checking. log_time_hour
has to come fourth in the index definition and subscriber_id
last.
In the query use new column in WHERE
and in GROUP BY
. Make sure that you include all needed columns in the GROUP BY
: both log_type
and log_time_hour
.
Do you need both COUNT
and COUNT(DISTINCT)
? Leave only COUNT
first and measure the performance. Leave only COUNT(DISTINCT)
and measure the performance. Leave both and measure the performance. See how they compare.
SELECT log_type,
log_time_hour,
count(DISTINCT subscriber_id) AS distinct_total,
COUNT(subscriber_id) AS total
FROM stats.campaign_logs
WHERE DOMAIN='xxx'
AND campaign_id='123'
AND log_type = 'EMAIL_OPENED'
AND log_time_hour >= '2015-02-01 00:00:00'
AND log_time_hour < '2015-03-02 00:00:00'
GROUP BY log_type, log_time_hour
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With