Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimise this MySQL query? Millions of Rows

I have the following query:

SELECT      analytics.source AS referrer,      COUNT(analytics.id) AS frequency,      SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales FROM analytics LEFT JOIN transactions ON analytics.id = transactions.analytics WHERE analytics.user_id = 52094  GROUP BY analytics.source  ORDER BY frequency DESC  LIMIT 10  

The analytics table has 60M rows and the transactions table has 3M rows.

When I run an EXPLAIN on this query, I get:

+------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+ | # id |  select_type |      table      |  type  |    possible_keys    |        key        |        key_len       |            ref            |   rows   |   Extra   |                                                 | +------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+ | '1'  |  'SIMPLE'    |  'analytics'    |  'ref' |  'analytics_user_id | analytics_source' |  'analytics_user_id' |  '5'                      |  'const' |  '337662' |  'Using where; Using temporary; Using filesort' | | '1'  |  'SIMPLE'    |  'transactions' |  'ref' |  'tran_analytics'   |  'tran_analytics' |  '5'                 |  'dijishop2.analytics.id' |  '1'     |  NULL     |                                                 | +------+--------------+-----------------+--------+---------------------+-------------------+----------------------+---------------------------+----------+-----------+-------------------------------------------------+ 

I can't figure out how to optimise this query as it's already very basic. It takes around 70 seconds to run this query.

Here are the indexes that exist:

+-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ |   # Table   |  Non_unique |          Key_name          |  Seq_in_index |    Column_name   |  Collation |  Cardinality |  Sub_part |  Packed |  Null  |  Index_type |  Comment |  Index_comment | +-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ | 'analytics' |  '0'        |  'PRIMARY'                 |  '1'          |  'id'            |  'A'       |  '56934235'  |  NULL     |  NULL   |  ''    |  'BTREE'    |  ''      |  ''            | | 'analytics' |  '1'        |  'analytics_user_id'       |  '1'          |  'user_id'       |  'A'       |  '130583'    |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            | | 'analytics' |  '1'        |  'analytics_product_id'    |  '1'          |  'product_id'    |  'A'       |  '490812'    |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            | | 'analytics' |  '1'        |  'analytics_affil_user_id' |  '1'          |  'affil_user_id' |  'A'       |  '55222'     |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            | | 'analytics' |  '1'        |  'analytics_source'        |  '1'          |  'source'        |  'A'       |  '24604'     |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            | | 'analytics' |  '1'        |  'analytics_country_name'  |  '1'          |  'country_name'  |  'A'       |  '39510'     |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            | | 'analytics' |  '1'        |  'analytics_gordon'        |  '1'          |  'id'            |  'A'       |  '56934235'  |  NULL     |  NULL   |  ''    |  'BTREE'    |  ''      |  ''            | | 'analytics' |  '1'        |  'analytics_gordon'        |  '2'          |  'user_id'       |  'A'       |  '56934235'  |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            | | 'analytics' |  '1'        |  'analytics_gordon'        |  '3'          |  'source'        |  'A'       |  '56934235'  |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            | +-------------+-------------+----------------------------+---------------+------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+   +----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ |    # Table     |  Non_unique |      Key_name     |  Seq_in_index |    Column_name    |  Collation |  Cardinality |  Sub_part |  Packed |  Null  |  Index_type |  Comment |  Index_comment | +----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ | 'transactions' |  '0'        |  'PRIMARY'        |  '1'          |  'id'             |  'A'       |  '2436151'   |  NULL     |  NULL   |  ''    |  'BTREE'    |  ''      |  ''            | | 'transactions' |  '1'        |  'tran_user_id'   |  '1'          |  'user_id'        |  'A'       |  '56654'     |  NULL     |  NULL   |  ''    |  'BTREE'    |  ''      |  ''            | | 'transactions' |  '1'        |  'transaction_id' |  '1'          |  'transaction_id' |  'A'       |  '2436151'   |  '191'    |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            | | 'transactions' |  '1'        |  'tran_analytics' |  '1'          |  'analytics'      |  'A'       |  '2436151'   |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            | | 'transactions' |  '1'        |  'tran_status'    |  '1'          |  'status'         |  'A'       |  '22'        |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            | | 'transactions' |  '1'        |  'gordon_trans'   |  '1'          |  'status'         |  'A'       |  '22'        |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            | | 'transactions' |  '1'        |  'gordon_trans'   |  '2'          |  'analytics'      |  'A'       |  '2436151'   |  NULL     |  NULL   |  'YES' |  'BTREE'    |  ''      |  ''            | +----------------+-------------+-------------------+---------------+-------------------+------------+--------------+-----------+---------+--------+-------------+----------+----------------+ 

Simplified schema for the two tables before adding any extra indexes as suggested as it didn't improve the situation.

CREATE TABLE `analytics` (   `id` int(11) NOT NULL AUTO_INCREMENT,   `user_id` int(11) DEFAULT NULL,   `affil_user_id` int(11) DEFAULT NULL,   `product_id` int(11) DEFAULT NULL,   `medium` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `source` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `terms` varchar(1024) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `is_browser` tinyint(1) DEFAULT NULL,   `is_mobile` tinyint(1) DEFAULT NULL,   `is_robot` tinyint(1) DEFAULT NULL,   `browser` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `mobile` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `robot` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `platform` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `referrer` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `domain` varchar(45) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `ip` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `continent_code` varchar(10) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `country_name` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `city` varchar(100) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,   PRIMARY KEY (`id`),   KEY `analytics_user_id` (`user_id`),   KEY `analytics_product_id` (`product_id`),   KEY `analytics_affil_user_id` (`affil_user_id`) ) ENGINE=InnoDB AUTO_INCREMENT=64821325 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci;  CREATE TABLE `transactions` (   `id` int(11) NOT NULL AUTO_INCREMENT,   `transaction_id` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `user_id` int(11) NOT NULL,   `pay_key` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `sender_email` varchar(255) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `amount` decimal(10,2) DEFAULT NULL,   `currency` varchar(10) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `status` varchar(50) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `analytics` int(11) DEFAULT NULL,   `ip_address` varchar(46) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `session_id` varchar(60) COLLATE utf8mb4_unicode_ci DEFAULT NULL,   `date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,   `eu_vat_applied` int(1) DEFAULT '0',   PRIMARY KEY (`id`),   KEY `tran_user_id` (`user_id`),   KEY `transaction_id` (`transaction_id`(191)),   KEY `tran_analytics` (`analytics`),   KEY `tran_status` (`status`) ) ENGINE=InnoDB AUTO_INCREMENT=10019356 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci; 

If the above can not be optimised any further. Any implementation advice on summary tables will be great. We are using a LAMP stack on AWS. The above query is running on RDS (m1.large).

like image 575
Abs Avatar asked Jun 21 '18 17:06

Abs


People also ask

Can MySQL handle 1 million records?

Can MySQL handle 100 million records? Yeah, it can handle billions of records. If you properly index tables, they fit in memory and your queries are written properly then it shouldn't be an issue.


2 Answers

I would create the following indexes (b-tree indexes):

analytics(user_id, source, id)  transactions(analytics, status) 

This is different from Gordon's suggestion.

The order of columns in the index is important.

You filter by specific analytics.user_id, so this field has to be the first in the index. Then you group by analytics.source. To avoid sorting by source this should be the next field of the index. You also reference analytics.id, so it is better to have this field as part of the index, put it last. Is MySQL capable of reading just the index and not touching the table? I don't know, but it is rather easy to test.

Index on transactions has to start with analytics, because it would be used in the JOIN. We also need status.

SELECT      analytics.source AS referrer,      COUNT(analytics.id) AS frequency,      SUM(IF(transactions.status = 'COMPLETED', 1, 0)) AS sales FROM analytics LEFT JOIN transactions ON analytics.id = transactions.analytics WHERE analytics.user_id = 52094  GROUP BY analytics.source  ORDER BY frequency DESC  LIMIT 10  
like image 69
Vladimir Baranov Avatar answered Sep 18 '22 20:09

Vladimir Baranov


First some analysis...

SELECT  a.source AS referrer,         COUNT(*) AS frequency,  -- See question below         SUM(t.status = 'COMPLETED') AS sales     FROM  analytics AS a     LEFT JOIN  transactions AS t  ON a.id = t.analytics AS a     WHERE  a.user_id = 52094     GROUP BY  a.source     ORDER BY  frequency DESC     LIMIT  10  

If the mapping from a to t is "one-to-many", then you need to consider whether the COUNT and SUM have the correct values or inflated values. As the query stands, they are "inflated". The JOIN occurs before the aggregation, so you are counting the number of transactions and how many were completed. I'll assume that is desired.

Note: The usual pattern is COUNT(*); saying COUNT(x) implies checking x for being NULL. I suspect that check is not needed?

This index handles the WHERE and is "covering":

 analytics:  INDEX(user_id, source, id)   -- user_id first   transactions:  INDEX(analytics, status)  -- in this order 

The GROUP BY may or may not require a 'sort'. The ORDER BY, being different than the GROUP BY, definitely will need a sort. And the entire grouped set of rows will need to be sorted; there is no shortcut for the LIMIT.

Normally, Summary tables are date-oriented. That is, the PRIMARY KEY includes a 'date' and some other dimensions. Perhaps, keying by date and user_id would make sense? How many transactions per day does the average user have? If at least 10, then let's consider a Summary table. Also, it is important not to be UPDATEing or DELETEing old records. More

I would probably have

user_id ..., source ..., dy DATE ..., status ..., freq      MEDIUMINT UNSIGNED NOT NULL, status_ct MEDIUMINT UNSIGNED NOT NULL, PRIMARY KEY(user_id, status, source, dy) 

Then the query becomes

SELECT  source AS referrer,         SUM(freq) AS frequency,         SUM(status_ct) AS completed_sales     FROM  Summary     WHERE  user_id = 52094       AND  status = 'COMPLETED'     GROUP BY source     ORDER BY  frequency DESC     LIMIT  10  

The speed comes from many factors

  • Smaller table (fewer rows to look at)
  • No JOIN
  • More useful index

(It still needs the extra sort.)

Even without the summary table, there may be some speedups...

  • How big are the tables? How big is `innodb_buffer_pool_size?
  • Normalizing some of the strings that are both bulky and repetitive could make that table not I/O-bound.
  • This is awful: KEY (transaction_id(191)); See here for 5 ways to fix it.
  • IP addresses do not need 255 bytes, nor utf8mb4_unicode_ci. (39) and ascii are sufficient.
like image 25
Rick James Avatar answered Sep 18 '22 20:09

Rick James