Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Condition on joined table faster than condition on reference

Tags:

sql

join

mysql

I have a query involving two tables: table A has lots of rows, and contains a field called b_id, which references a record from table B, which has about 30 different rows. Table A has an index on b_id, and table B has an index on the column name.

My query looks something like this:

SELECT COUNT(A.id) FROM A INNER JOIN B ON B.id = A.b_id WHERE (B.name != 'dummy') AND <condition>;

With condition being some random condition on table A (I have lots of those, all exhibiting the same behavior).

This query is extremely slow (taking north of 2 seconds), and using explain, shows that query optimizer starts with table B, coming up with about 29 rows, and then scans table A. Doing a STRAIGHT_JOIN, turned the order around and the query ran instantaneously.

I'm not a fan of black magic, so I decided to try something else: come up with the id for the record in B that has the name dummy, let's say 23, and then simplify the query to:

SELECT COUNT(A.id) FROM A WHERE (b_id != 23) AND <condition>;

To my surprise, this query was actually slower than the straight join, taking north of a second.

Any ideas on why the join would be faster than the simplified query?

UPDATE: following a request in the comments, the outputs from explain:

Straight join:

+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
| id | select_type | table | type   | possible_keys   | key     | key_len | ref           | rows   | Extra       |
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+
|  1 | SIMPLE      | A     | ALL    | b_id            | NULL    | NULL    | NULL          | 200707 | Using where |
|  1 | SIMPLE      | B     | eq_ref | PRIMARY,id_name | PRIMARY | 4       | schema.A.b_id |     1  | Using where |
+----+-------------+-------+--------+-----------------+---------+---------+---------------+--------+-------------+

No join:

+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
| id | select_type | table | type | possible_keys | key  | key_len | ref  | rows   | Extra       |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+
|  1 | SIMPLE      | A     | ALL  | b_id          | NULL | NULL    | NULL | 200707 | Using where |
+----+-------------+-------+------+---------------+------+---------+------+--------+-------------+

UPDATE 2: Tried another variant:

SELECT COUNT(A.id) FROM A WHERE b_id IN (<all the ids except for 23>) AND <condition>;

This runs faster than the no join, but still slower than the join, so it seems that the inequality operation is responsible for part of the performance hit, but not all.

like image 283
On Freund Avatar asked Oct 23 '13 22:10

On Freund


Video Answer


1 Answers

If you are using MySQL 5.6 or later then you can ask the query optimizer what it is doing;

SET optimizer_trace="enabled=on";

## YOUR QUERY 
SELECT COUNT(*) FROM transactions WHERE (id < 9000) and user != 11;
##END YOUR QUERY

SELECT trace FROM information_schema.optimizer_trace;

SET optimizer_trace="enabled=off";

You will almost certainly need to refer to the following sections in the MySQL reference Tracing the Optimiser and The Optimizer


Looking at the first explain it appears that the query is quicker probably because the optimizer can use the table B to filter down to the rows required based on the join and then use the foreign key to get the rows in table A.

In the explain it's this bit that is interesting; there is only one row matching and it's using schema.A.b_id. Effectively this is pre-filtering the rows from A which is where I think the performance difference comes from.

   | ref           | rows   | Extra       |
   | schema.A.b_id |     1  | Using where |

So, as is usual with queries it is all down to indexes - or more accurately missing indexes. Just because you have indexes on individual fields it doesn't necessarily mean that these are suitable for the query you're running.

Basic rule: If the EXPLAIN doesn't say Using Index then you need to add a suitable index.

Looking at the explain output the first interesting thing is ironically the last thing on each line; namely the Extra

In the first example we see

|  1 | SIMPLE      | A     | .... Using where |
|  1 | SIMPLE      | B     | ...  Using where |

Both of these Using where is not good; ideally at least one, and preferably both should say Using index

When you do

SELECT COUNT(A.id) FROM A WHERE (b_id != 23) AND <condition>;

and see Using where then you need to add an index as it's doing a table scan.

for example if you did

EXPLAIN SELECT COUNT(A.id) FROM A WHERE (Id > 23)

You should see Using where; Using index (assuming here that Id is the primary key and has an index)

If you then added a condition onto the end

EXPLAIN SELECT COUNT(A.id) FROM A WHERE (Id > 23) and Field > 0

and see Using where then you need to add an index for the two fields. Just having an index on a field doesn't mean that MySQL will be able to use that index during the query across multiple fields - this is something that internally the query optimizer will decide upon. I'm not exactly certain of the internal rules; but generally adding an extra index to match the query helps immensely.

So adding an index (on the two fields in the query above):

ALTER TABLE `A` ADD INDEX `IndexIdField` (`Id`,`Field`)

should change it such that when querying based upon those two fields there is an index.

I've tried this on one of my databases that has Transactions and User tables.

I'll use this query

EXPLAIN SELECT COUNT(*) FROM transactions WHERE (id < 9000) and user != 11;

Running without index on the two fields:

PRIMARY,user    PRIMARY 4   NULL    14334   Using where

Then add an index:

ALTER TABLE `transactions` ADD INDEX `IndexIdUser` (`id`, `user`);

Then the same query again and this time

PRIMARY,user,Index 4    Index 4 4   NULL    12628   Using where; Using index

This time it's using the indexes - and as a result will be a lot quicker.


From comments by @Wrikken - and also bear in mind that I don't have the accurate schema / data so some of this investigation has required assumptions about the schema (which may be wrong)

SELECT COUNT(A.id) FROM A FORCE INDEX (b_id)

would perform at least as good as 

SELECT COUNT(A.id) FROM A INNER JOIN B ON A.b_id = B.id.

If we look at the first EXPLAIN in the OP we see that there are two elements to the query. Referring to the EXPLAIN documentation for *eq_ref* I can see that this is going to define the rows for consideration based on this relationship.

The order of the explain output doesn't necessarily mean it's doing one and then the other; it's simply what has been chosen to execute the query (at least as far as I can tell).

For some reason the query optimizer has decided not to use the index on b_id - I'm assuming here that because of the query the optimizer has decided that it will be more efficient to do a table scan.

The second explain concerns me a little because it's not considering the index on b_id; possibly because of the AND <condition> (which is omitted so I'm guessing as to what it could be). When I try this with an index on b_id it does use the index; but as soon as a condition is added it doesn't use the index.

So, when doing

  SELECT COUNT(A.id) FROM A INNER JOIN B ON A.b_id = B.id.

This all indicates to me is that the PRIMARY index on B is where the speed difference is coming from; I'm assuming because of the schema.A.b_id in the explain that there is a Foreign key on this table; which must be a better collection of related rows than the index on b_id - so the query optimizer can use this relationship to define which rows to pick - and because a primary index is better than secondary indexes it's going to be much quicker to select rows out of B and then use the relationship link to match against the rows in A.

like image 198
Richard Harrison Avatar answered Oct 29 '22 21:10

Richard Harrison