I have the following query:
SELECT analytics.source AS referrer, COUNT(analytics.id) AS frequency, SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales FROM analytics LEFT JOIN transactions ON analytics.id = transactions.analytics WHERE analytics.user_id = 52094 GROUP BY analytics.source ORDER BY frequency DESC LIMIT 10
The analytics table has 60M rows and the transactions table has 3M rows.
When I run an EXPLAIN
on this query, I get:
+------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+ | # id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | | +------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+ | '1' | 'SIMPLE' | 'analytics' | 'ref' | 'analytics_user_id | analytics_source' | 'analytics_user_id' | '5' | 'const' | '337662' | 'Using where; Using temporary; Using filesort' | | '1' | 'SIMPLE' | 'transactions' | 'ref' | 'tran_analytics' | 'tran_analytics' | '5' | 'dijishop2.analytics.id' | '1' | NULL | | +------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+
I can't figure out how to optimise this query as it's already very basic. It takes around 70 seconds to run this query.
Here are the indexes that exist:
+-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ | # Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ | 'analytics' | '0' | 'PRIMARY' | '1' | 'id' | 'A' | '56934235' | NULL | NULL | '' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_user_id' | '1' | 'user_id' | 'A' | '130583' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_product_id' | '1' | 'product_id' | 'A' | '490812' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_affil_user_id' | '1' | 'affil_user_id' | 'A' | '55222' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_source' | '1' | 'source' | 'A' | '24604' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_country_name' | '1' | 'country_name' | 'A' | '39510' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_gordon' | '1' | 'id' | 'A' | '56934235' | NULL | NULL | '' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_gordon' | '2' | 'user_id' | 'A' | '56934235' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'analytics' | '1' | 'analytics_gordon' | '3' | 'source' | 'A' | '56934235' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | +-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ +----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ | # Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ | 'transactions' | '0' | 'PRIMARY' | '1' | 'id' | 'A' | '2436151' | NULL | NULL | '' | 'BTREE' | '' | '' | | 'transactions' | '1' | 'tran_user_id' | '1' | 'user_id' | 'A' | '56654' | NULL | NULL | '' | 'BTREE' | '' | '' | | 'transactions' | '1' | 'transaction_id' | '1' | 'transaction_id' | 'A' | '2436151' | '191' | NULL | 'YES' | 'BTREE' | '' | '' | | 'transactions' | '1' | 'tran_analytics' | '1' | 'analytics' | 'A' | '2436151' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'transactions' | '1' | 'tran_status' | '1' | 'status' | 'A' | '22' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'transactions' | '1' | 'gordon_trans' | '1' | 'status' | 'A' | '22' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | | 'transactions' | '1' | 'gordon_trans' | '2' | 'analytics' | 'A' | '2436151' | NULL | NULL | 'YES' | 'BTREE' | '' | '' | +----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+
Simplified schema for the two tables before adding any extra indexes as suggested as it didn't improve the situation.
CREATE TABLE `analytics` ( `id` int(11) NOT NULL AUTO_INCREMENT, `user_id` int(11) DEFAULT NULL, `affil_user_id` int(11) DEFAULT NULL, `product_id` int(11) DEFAULT NULL, `medium` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `source` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `terms` varchar(1024) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `is_browser` tinyint(1) DEFAULT NULL, `is_mobile` tinyint(1) DEFAULT NULL, `is_robot` tinyint(1) DEFAULT NULL, `browser` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `mobile` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `robot` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `platform` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `referrer` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `domain` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `ip` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `continent_code` varchar(10) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `country_name` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `city` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, PRIMARY KEY (`id`), KEY `analytics_user_id` (`user_id`), KEY `analytics_product_id` (`product_id`), KEY `analytics_affil_user_id` (`affil_user_id`) ) ENGINE=InnoDB AUTO_INCREMENT=64821325 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci; CREATE TABLE `transactions` ( `id` int(11) NOT NULL AUTO_INCREMENT, `transaction_id` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `user_id` int(11) NOT NULL, `pay_key` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `sender_email` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `amount` decimal(10,2) DEFAULT NULL, `currency` varchar(10) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `status` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `analytics` int(11) DEFAULT NULL, `ip_address` varchar(46) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `session_id` varchar(60) COLLATE utf8mb4_unicode_ci DEFAULT NULL, `date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP, `eu_vat_applied` int(1) DEFAULT '0', PRIMARY KEY (`id`), KEY `tran_user_id` (`user_id`), KEY `transaction_id` (`transaction_id`(191)), KEY `tran_analytics` (`analytics`), KEY `tran_status` (`status`) ) ENGINE=InnoDB AUTO_INCREMENT=10019356 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;
If the above can not be optimised any further. Any implementation advice on summary tables will be great. We are using a LAMP stack on AWS. The above query is running on RDS (m1.large).
Can MySQL handle 100 million records? Yeah, it can handle billions of records. If you properly index tables, they fit in memory and your queries are written properly then it shouldn't be an issue.
I would create the following indexes (b-tree indexes):
analytics(user_id, source, id) transactions(analytics, status)
This is different from Gordon's suggestion.
The order of columns in the index is important.
You filter by specific analytics.user_id
, so this field has to be the first in the index. Then you group by analytics.source
. To avoid sorting by source
this should be the next field of the index. You also reference analytics.id
, so it is better to have this field as part of the index, put it last. Is MySQL capable of reading just the index and not touching the table? I don't know, but it is rather easy to test.
Index on transactions
has to start with analytics
, because it would be used in the JOIN
. We also need status
.
SELECT analytics.source AS referrer, COUNT(analytics.id) AS frequency, SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales FROM analytics LEFT JOIN transactions ON analytics.id = transactions.analytics WHERE analytics.user_id = 52094 GROUP BY analytics.source ORDER BY frequency DESC LIMIT 10
First some analysis...
SELECT a.source AS referrer, COUNT(*) AS frequency, -- See question below SUM(t.status = 'COMPLETED') AS sales FROM analytics AS a LEFT JOIN transactions AS t ON a.id = t.analytics AS a WHERE a.user_id = 52094 GROUP BY a.source ORDER BY frequency DESC LIMIT 10
If the mapping from a
to t
is "one-to-many", then you need to consider whether the COUNT
and SUM
have the correct values or inflated values. As the query stands, they are "inflated". The JOIN
occurs before the aggregation, so you are counting the number of transactions and how many were completed. I'll assume that is desired.
Note: The usual pattern is COUNT(*)
; saying COUNT(x)
implies checking x
for being NULL
. I suspect that check is not needed?
This index handles the WHERE
and is "covering":
analytics: INDEX(user_id, source, id) -- user_id first transactions: INDEX(analytics, status) -- in this order
The GROUP BY
may or may not require a 'sort'. The ORDER BY
, being different than the GROUP BY
, definitely will need a sort. And the entire grouped set of rows will need to be sorted; there is no shortcut for the LIMIT
.
Normally, Summary tables are date-oriented. That is, the PRIMARY KEY
includes a 'date' and some other dimensions. Perhaps, keying by date and user_id would make sense? How many transactions per day does the average user have? If at least 10, then let's consider a Summary table. Also, it is important not to be UPDATEing
or DELETEing
old records. More
I would probably have
user_id ..., source ..., dy DATE ..., status ..., freq MEDIUMINT UNSIGNED NOT NULL, status_ct MEDIUMINT UNSIGNED NOT NULL, PRIMARY KEY(user_id, status, source, dy)
Then the query becomes
SELECT source AS referrer, SUM(freq) AS frequency, SUM(status_ct) AS completed_sales FROM Summary WHERE user_id = 52094 AND status = 'COMPLETED' GROUP BY source ORDER BY frequency DESC LIMIT 10
The speed comes from many factors
JOIN
(It still needs the extra sort.)
Even without the summary table, there may be some speedups...
Normalizing
some of the strings that are both bulky and repetitive could make that table not I/O-bound.KEY (transaction_id(191))
; See here for 5 ways to fix it.utf8mb4_unicode_ci
. (39) and ascii are sufficient.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With