Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Make MySQL to choose the best index for a query

In MySQL 5.6 DB I have a huge SQL table of the following structure:

CREATE TABLE `tbl_requests` (
    `request_id` BIGINT(20) UNSIGNED NOT NULL,
    `option_id` BIGINT(20) UNSIGNED NOT NULL,
    `symbol` VARCHAR(30) NOT NULL,
    `request_time` DATETIME(6) NOT NULL,
    `request_type` SMALLINT(6) NOT NULL,
    `count` INT(11) NOT NULL,
    PRIMARY KEY (`request_id`),
    INDEX `key_request_type_symbol` (`request_type`, `symbol`),
    INDEX `key_request_time` (`request_time`),
    INDEX `key_request_symbol` (`symbol`)
);

There are over 800 million records in the table with about 25,000 varieties of symbol field and about 100 distinct values in request_type. My goal is to make as quick as possible a query like:

SELECT tbl_requests.*
FROM tbl_requests  use index (key_request_type_symbol)
-- use index (key_request_time) -- use index (key_request_type_symbol)
WHERE (tbl_requests.request_time >= '2016-02-23' AND 
       tbl_requests.request_time <= '2016-12-23') 
AND (tbl_requests.request_type IN (0, 1, 9))  
[AND (tbl_requests.symbol = 'AAPL' ... )]
ORDER BY tbl_requests.request_time DESC, tbl_requests.request_id DESC
LIMIT 0,100;

with different varieties of filtering by tbl_requests.symbol field from no filter to a set of values to set of matching patterns to a mix and match.

What I see is that different indexes give the best performance in different cases, and MySQL is unable to guess which one will be better. For example, with no filter the fastest is key_request_time index (0.016 sec.) and MySQL properly selects it (result of EXPLAIN command):

"id": 1,
"select_type": "SIMPLE",
"table": "tbl_requests",
"type": "range",
"possible_keys": "key_request_type_symbol,key_request_time",
"key": "key_request_time",
"key_len": "8",
"ref": null,
"rows": 428944675,
"Extra": "Using index condition; Using where"

If index key_request_type_symbol index was used this query would take just huge amount of time (maybe hours?).

I use syntax

FROM tbl_requests use index (key_request_type_symbol)

to force using an index.

When one symbol is used in the filter

AND (tbl_requests.symbol = 'BAC')

MySQL server is choosing the same key_request_time index, and the query takes more than 10 sec. But if key_request_type_symbol index is used, the query takes about 0.7 sec. Also, when using the first index, if the query is repeated again it keeps taking over 10 sec., while when using the second index, the repeated queries take 0.1 sec.
EXPLAIN info for key_request_type_symbol index:

"id": 1,
"select_type": "SIMPLE",
"table": "tbl_requests",
"type": "range",
"possible_keys": "key_request_type_symbol",
"key": "key_request_type_symbol",
"key_len": "34",
"ref": null,
"rows": 17117,
"Extra": "Using index condition; Using where; Using filesort"

A lot less rows, but with filesort.

It looks like in the case of key_request_type_symbol it matters how many matching rows are in the table. For "AMZN" symbol, rows = 79762, and time is 0.15 sec, while if using key_request_time index it takes 4.4 sec. But MySQL prefers it over key_request_type_symbol.

It is clear to see in the following example. If I use:

tbl_requests.symbol LIKE 'A%' 

with key_request_time index it takes 0.172 sec.
with key_request_type_symbol index it takes 173 sec. (~1000 times slower)
rows=6367732

For:

tbl_requests.symbol LIKE 'AM%' 

with key_request_time index it takes 0.640 sec.
with key_request_type_symbol index it takes 2.2 sec. (~3 times slower)
rows=838822

For:

tbl_requests.symbol LIKE 'AMZ%' 

with key_request_time index it takes 4.5 sec.
with key_request_type_symbol index it takes 0.15 sec. (~30 times faster)
rows=73083

For:

tbl_requests.symbol LIKE 'AMZN%' 

with key_request_time index it takes 4.4 sec.
with key_request_type_symbol index it takes 0.15 sec. (~30 times faster)
rows=79762

Also when using key_request_type_symbol index the execution gets much faster when the same symbol filter is used again while for key_request_time timing stays about the same.

I am going to receive a lot of queries with one symbol so I need them to be fast. But also I may receive queries filtered by many symbols. How can I force the server to pick the fastest way for me in each case?

One method I can imagine is to send EXPLAIN statement ahead and check the number of expected rows in case of key_request_type_symbol index, and then modify the query to use this or that index accordingly (like, if rows is over 300000, use key_request_time).

But maybe I am missing something? Maybe the indexes are not correct (but I couldn't find better)? It would be nice to keep the query unmodified and force MySQL to be smart enough to choose the fastest way automatically.

like image 606
Vladimir Shutow Avatar asked Dec 07 '16 21:12

Vladimir Shutow


People also ask

How does MySQL choose which index to use?

If there is a choice between multiple indexes, MySQL normally uses the index that finds the smallest number of rows (the most selective index). If the table has a multiple-column index, any leftmost prefix of the index can be used by the optimizer to look up rows.

How do I force an index in MySQL?

In case the query optimizer ignores the index, you can use the FORCE INDEX hint to instruct it to use the index instead. In this syntax, you put the FORCE INDEX clause after the FROM clause followed by a list of named indexes that the query optimizer must use.

Does MySQL use index for in query?

Introduction to MySQL indexesAn index is a data structure used to locate data without scanning all the rows in a table for a given query. Indexes help retrieve data faster. Indexes are not visible to the users. They help speed up queries that require a search.


1 Answers

Here's the rule you're missing about how MySQL uses indexes:

  1. The left-most column(s) in your index must match the column(s) for equality comparisons (e.g. symbol = 'AAPL'). You can have several columns, as long as they're all doing equality conditions.
  2. Then the single next column in the index can match a column for range comparison. A range comparison is anything other than equality. So: <>, >, <, IN(), BETWEEN, LIKE with no leading wildcard, or IS [NOT] NULL.
  3. An index can also be used for GROUP BY or ORDER BY, but not if you have used an index for a range condition. Basically, you get one more column in your index, following the column(s) doing equality-tests.

Example: Suppose you have a query with the following conditions:

WHERE a = 1 AND b = 2 AND c > 3 AND d IN (4,5,6)

Suppose you have an index on (a, b, c, d) in that order. Only the a, b, c columns from the index will help the query. Since the c column is in an inequality comparison, that's the last column in the index that helps.

(Actually, InnoDB has a recent feature called "index condition pushdown" which may allow the storage engine to help a little bit more by searching for values of d, but don't count on that being as good as regular index lookups. I saw the note "Using index condition" in one of your EXPLAIN outputs, indicating that it's employing this feature. Read http://dev.mysql.com/doc/refman/5.7/en/index-condition-pushdown-optimization.html for more details.)

Likewise, this query would not be able to use d to avoid the filesort in the following query, because of c's inequality condition.

WHERE a = 1 AND b = 2 AND c > 3
ORDER BY d

Whereas the following would be able to use d for optimizing the sort, because once the query finds the subset of rows where c=3, then the remaining matches are naturally read in d order.

WHERE a = 1 AND b = 2 AND c = 3
ORDER BY d

Now for how this applies to your query:

WHERE (tbl_requests.request_time >= '2016-02-23' AND 
       tbl_requests.request_time <= '2016-12-23') 
AND (tbl_requests.request_type IN (0, 1, 9))  
[AND (tbl_requests.symbol = 'AAPL' ... )]
ORDER BY tbl_requests.request_time DESC, tbl_requests.request_id DESC

The condition on symbol is equality. That should go left-most in the index.

The conditions on request_time and request_type are both inequality. You can only benefit from one or the other in an index. Choose the one that is most selective—that narrows down the search the best. Add the other column to the index just in case ICP can help a little.

I'd guess that the request_time column is more selective in most cases. I see your condition is a 10-month range, which might be most of your table, but depending on the date range you choose, it could be more narrow.

Likewise, the three values 0, 1, 9 for request_type might also match most of the rows in your table. If so, then that condition would not be very selective, and I'd put that column last.

ALTER TABLE tbl_requests ADD INDEX (symbol, request_time, request_type);

The order request_time happens after the inequality conditions, so there's no way to avoid filesorting the matching rows, sorry.

like image 112
Bill Karwin Avatar answered Sep 18 '22 22:09

Bill Karwin