Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to optimize execution plan for query with multiple outer joins to huge tables, group by and order by clauses?

I have the following database (simplified):

CREATE TABLE `tracking` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `manufacture` varchar(100) NOT NULL,
  `date_last_activity` datetime NOT NULL,
  `date_created` datetime NOT NULL,
  `date_updated` datetime NOT NULL,
  PRIMARY KEY (`id`),
  KEY `manufacture` (`manufacture`),
  KEY `manufacture_date_last_activity` (`manufacture`, `date_last_activity`),
  KEY `date_last_activity` (`date_last_activity`),
) ENGINE=InnoDB AUTO_INCREMENT=401353 DEFAULT CHARSET=utf8

CREATE TABLE `tracking_items` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `tracking_id` int(11) NOT NULL,
  `tracking_object_id` varchar(100) NOT NULL,
  `tracking_type` int(11) NOT NULL COMMENT 'Its used to specify the type of each item, e.g. car, bike, etc',
  `date_created` datetime NOT NULL,
  `date_updated` datetime NOT NULL,
  PRIMARY KEY (`id`),
  KEY `tracking_id` (`tracking_id`),
  KEY `tracking_object_id` (`tracking_object_id`),
  KEY `tracking_id_tracking_object_id` (`tracking_id`,`tracking_object_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1299995 DEFAULT CHARSET=utf8

CREATE TABLE `cars` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `car_id` varchar(255) NOT NULL COMMENT 'It must be VARCHAR, because the data is coming from external source.',
  `manufacture` varchar(255) NOT NULL,
  `car_text` text CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  `date_order` datetime NOT NULL,
  `date_created` datetime NOT NULL,
  `date_updated` datetime NOT NULL,
  `deleted` tinyint(4) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`),
  UNIQUE KEY `car_id` (`car_id`),
  KEY `sort_field` (`date_order`)
) ENGINE=InnoDB AUTO_INCREMENT=150000025 DEFAULT CHARSET=utf8

This is my "problematic" query, that runs extremely slow.

SELECT sql_no_cache `t`.*,
       count(`t`.`id`) AS `cnt_filtered_items`
FROM `tracking` AS `t`
INNER JOIN `tracking_items` AS `ti` ON (`ti`.`tracking_id` = `t`.`id`)
LEFT JOIN `cars` AS `c` ON (`c`.`car_id` = `ti`.`tracking_object_id`
                            AND `ti`.`tracking_type` = 1)
LEFT JOIN `bikes` AS `b` ON (`b`.`bike_id` = `ti`.`tracking_object_id`
                            AND `ti`.`tracking_type` = 2)
LEFT JOIN `trucks` AS `tr` ON (`tr`.`truck_id` = `ti`.`tracking_object_id`
                            AND `ti`.`tracking_type` = 3)
WHERE (`t`.`manufacture` IN('1256703406078',
                            '9600048390403',
                            '1533405067830'))
  AND (`c`.`car_text` LIKE '%europe%'
       OR `b`.`bike_text` LIKE '%europe%'
       OR `tr`.`truck_text` LIKE '%europe%')
GROUP BY `t`.`id`
ORDER BY `t`.`date_last_activity` ASC,
         `t`.`id` ASC
LIMIT 15

This is the result of EXPLAIN for above query:

+----+-------------+-------+--------+-----------------------------------------------------------------------+-------------+---------+-----------------------------+---------+----------------------------------------------+
| id | select_type | table |  type  |                             possible_keys                             |     key     | key_len |             ref             |  rows   |                    extra                     |
+----+-------------+-------+--------+-----------------------------------------------------------------------+-------------+---------+-----------------------------+---------+----------------------------------------------+
|  1 | SIMPLE      | t     | index  | PRIMARY,manufacture,manufacture_date_last_activity,date_last_activity | PRIMARY     |       4 | NULL                        | 400,000 | Using where; Using temporary; Using filesort |
|  1 | SIMPLE      | ti    | ref    | tracking_id,tracking_object_id,tracking_id_tracking_object_id         | tracking_id |       4 | table.t.id                  |       1 | NULL                                         |
|  1 | SIMPLE      | c     | eq_ref | car_id                                                                | car_id      |     767 | table.ti.tracking_object_id |       1 | Using where                                  |
|  1 | SIMPLE      | b     | eq_ref | bike_id                                                               | bike_id     |     767 | table.ti.tracking_object_id |       1 | Using where                                  |
|  1 | SIMPLE      | t     | eq_ref | truck_id                                                              | truck_id    |     767 | table.ti.tracking_object_id |       1 | Using where                                  |
+----+-------------+-------+--------+-----------------------------------------------------------------------+-------------+---------+-----------------------------+---------+----------------------------------------------+

What is the problem this query is trying to solve?

Basically, I need to find all records in tracking table that may be associated with records in tracking_items (1:n) where each record in tracking_items may be associated with record in left joined tables. The filtering criteria is crucial part in the query.

What is the problem I have with the query above?

When there's order by and group by clauses the query runs extremely slow, e.g. 10-15 seconds to complete for the above configuration. However, if I omit any of these clauses, the query is running pretty quick (~0.2 seconds).

What I've already tried?

  1. I've tried to used a FULLTEXT index, but it didn't help much, as the results evaluated by the LIKE statemenet are narrowed by the JOINs using indexes.
  2. I've tried to use WHERE EXISTS (...) to find if there are records in left joined tables, but unfortunately without any luck.

Few notes about relations between these tables:

tracking -> tracking_items (1:n)
tracking_items -> cars (1:1)
tracking_items -> bikes (1:1)
tracking_items -> trucks (1:1)

So, I'm looking for a way to optimize that query.

like image 385
Kristian Vitozev Avatar asked Aug 14 '18 16:08

Kristian Vitozev


People also ask

Does the order of joins matter for performance?

Even though the join order has no impact on the final result, it still affects performance. The optimizer will therefore evaluate all possible join order permutations and select the best one.

Which SQL query is faster filter on join criteria or WHERE clause?

I ran some tests and the results show that it is actually very close, but the WHERE clause is actually slightly faster! =) I absolutely agree that it makes more sense to apply the filter on the WHERE clause, I was just curious as to the performance implications.


3 Answers

Bill Karwin suggests the query might perform better if it used an index with a leading column of manufacture. I second that suggestion. Especially if that's very selective.

I also note that we're doing a GROUP BY t.id, where id is the PRIMARY KEY of the table.

No columns from any tables other than tracking are referenced in the SELECT list.

This suggests we're really only interested in returning rows from t, and not on creating duplicates due to multiple outer joins.

Seems like the COUNT() aggregate has the potential to return an inflated count, if there are multiple matching rows in tracking_item and bikes,cars,trucks. If there's three matching rows from cars, and four matching rows from bikes, ... the COUNT() aggregate is going to return a value of 12, rather than 7. (Or maybe there is some guarantee in the data such that there won't ever be multiple matching rows.)

If the manufacture is very selective, and that returns a reasonably small set of rows from tracking, if the query can make use of an index ...

And since we aren't returning any columns from any tables other than tracking, apart from a count or related items ...

I would be tempted to test correlated subqueries in the SELECT list, to get the count, and filter out the zero count rows using a HAVING clause.

Something like this:

SELECT SQL_NO_CACHE `t`.*
     , ( ( SELECT COUNT(1)
             FROM `tracking_items` `tic`
             JOIN `cars` `c`
               ON `c`.`car_id`           = `tic`.`tracking_object_id`
              AND `c`.`car_text`      LIKE '%europe%'
            WHERE `tic`.`tracking_id`    = `t`.`id`
              AND `tic`.`tracking_type`  = 1
         )
       + ( SELECT COUNT(1)
             FROM `tracking_items` `tib`
             JOIN `bikes` `b`
               ON `b`.`bike_id`          = `tib`.`tracking_object_id` 
              AND `b`.`bike_text`     LIKE '%europe%'
            WHERE `tib`.`tracking_id`    = `t`.`id`
              AND `tib`.`tracking_type`  = 2
         )
       + ( SELECT COUNT(1)
             FROM `tracking_items` `tit`
             JOIN `trucks` `tr`
               ON `tr`.`truck_id`        = `tit`.`tracking_object_id`
              AND `tr`.`truck_text`   LIKE '%europe%'
            WHERE `tit`.`tracking_id`    = `t`.`id`
              AND `tit`.`tracking_type`  = 3
         ) 
       ) AS cnt_filtered_items
  FROM `tracking` `t`
 WHERE `t`.`manufacture` IN ('1256703406078', '9600048390403', '1533405067830')
HAVING cnt_filtered_items > 0
 ORDER
    BY `t`.`date_last_activity` ASC
     , `t`.`id` ASC

We'd expect that the query could make effective use of an index on tracking with leading column of manufacture.

And on the tracking_items table, we want an index with leading columns of type and tracking_id. And including tracking_object_id in that index would mean the query could be satisfied from the index, without visiting the underlying pages.

For the cars, bikes and trucks tables the query should make use of an index with leading column of car_id, bike_id, and truck_id respectively. There's no getting around a scan of the car_text, bike_text, truck_text columns for the matching string... best we can do is narrow down the number rows that need to have that check performed.

This approach (just the tracking table in the outer query) should eliminate the need for the GROUP BY, the work required to identify and collapse duplicate rows.

BUT this approach, replacing joins with correlated subqueries, is best suited to queries where there is a SMALL number of rows returned by the outer query. Those subqueries get executed for every row processed by the outer query. It's imperative that those subqueries to have suitable indexes available. Even with those tuned, there is still potential for horrible performance for large sets.

This does still leave us with a "Using filesort" operation for the ORDER BY.


If the count of related items should be the product of a multiplication, rather than addition, we could tweak the query to achieve that. (We'd have to muck with the return of zeros, and the condition in the HAVING clause would need to be changed.)

If there wasn't a requirement to return a COUNT() of related items, then I would be tempted to move the correlated subqueries from the SELECT list down into EXISTS predicates in the WHERE clause.


Additional notes: seconding the comments from Rick James regarding indexing... there appears to be redundant indexes defined. i.e.

KEY `manufacture` (`manufacture`)
KEY `manufacture_date_last_activity` (`manufacture`, `date_last_activity`)

The index on the singleton column isn't necessary, since there is another index that has the column as the leading column.

Any query that can make effective use of the manufacture index will be able to make effective use of the manufacture_date_last_activity index. That is to say, the manufacture index could be dropped.

The same applies for the tracking_items table, and these two indexes:

KEY `tracking_id` (`tracking_id`)
KEY `tracking_id_tracking_object_id` (`tracking_id`,`tracking_object_id`)

The tracking_id index could be dropped, since it's redundant.

For the query above, I would suggest adding a covering index:

KEY `tracking_items_IX3` (`tracking_id`,`tracking_type`,`tracking_object_id`)

-or- at a minimum, a non-covering index with those two columns leading:

KEY `tracking_items_IX3` (`tracking_id`,`tracking_type`)
like image 155
spencer7593 Avatar answered Nov 08 '22 09:11

spencer7593


The EXPLAIN shows you are doing an index-scan ("index" in the type column) on the tracking table. An index-scan is pretty much as costly as a table-scan, especially when the index scanned is the PRIMARY index.

The rows column also shows that this index-scan is examining > 355K rows (since this figure is only a rough estimate, it's in fact examining all 400K rows).

Do you have an index on t.manufacture? I see two indexes named in the possible keys that might include that column (I can't be sure solely based on the name of the index), but for some reason the optimizer isn't using them. Maybe the set of values you search for is matched by every row in the table anyway.

If the list of manufacture values is intended to match a subset of the table, then you might need to give a hint to the optimizer to make it use the best index. https://dev.mysql.com/doc/refman/5.6/en/index-hints.html

Using LIKE '%word%' pattern-matching can never utilize an index, and must evaluate the pattern-match on every row. See my presentation, Full Text Search Throwdown.

How many items are in your IN(...) list? MySQL sometimes has problems with very long lists. See https://dev.mysql.com/doc/refman/5.6/en/range-optimization.html#equality-range-optimization

P.S.: When you ask a query optimization question, you should always include the SHOW CREATE TABLE output for each table referenced in the query, so folks who answer don't have to guess at what indexes, data types, constraints you currently have.

like image 36
Bill Karwin Avatar answered Nov 08 '22 09:11

Bill Karwin


First of all: Your query makes assumptions about string contents, which it shouldn't. What may car_text like '%europe%' indicate? Something like 'Sold in Europe only' maybe? Or Sold outside Europe only? Two possible strings with contradictory meanings. So if you assume a certain meaning once you find europe in the string, then you should be able to introduce this knowledge in the database - with a Europe flag or a region code for instance.

Anyway, you are showing certain trackings with their Europe transportation count. So select trackings, select transportation counts. You can either have the aggregation subquery for transportation counts in your SELECT clause or in your FROM clause.

Subquery in SELECT clause:

select
  t.*,
  (
    select count(*)
    from tracking_items ti
    where ti.tracking_id = t.id
    and (tracking_type, tracking_object_id) in
    (
      select 1, car_id from cars where car_text like '%europe%'
      union all
      select 2, bike_id from bikes where bike_text like '%europe%'
      union all
      select 3, truck_id from trucks where truck_text like '%europe%'
    )
from tracking t
where manufacture in ('1256703406078', '9600048390403', '1533405067830')
order by date_last_activity, id;

Subquery in FROM clause:

select
  t.*, agg.total
from tracking t
left join
(
  select tracking_id, count(*) as total
  from tracking_items ti
  and (tracking_type, tracking_object_id) in
  (
    select 1, car_id from cars where car_text like '%europe%'
    union all
    select 2, bike_id from bikes where bike_text like '%europe%'
    union all
    select 3, truck_id from trucks where truck_text like '%europe%'
  )
  group by tracking_id
) agg on agg.tracking_id = t.id
where manufacture in ('1256703406078', '9600048390403', '1533405067830')
order by date_last_activity, id;

Indexes:

  • tracking(manufacture, date_last_activity, id)
  • tracking_items(tracking_id, tracking_type, tracking_object_id)
  • cars(car_text, car_id)
  • bikes(bike_text, bike_id)
  • trucks(truck_text, truck_id)

Sometimes MySQL is stronger on simple joins than on anything else, so it may be worth a try to blindly join transportation records and only later see whether it's car, bike or truck:

select
  t.*, agg.total
from tracking t
left join
(
  select
    tracking_id,
    sum((ti.tracking_type = 1 and c.car_text like '%europe%')
        or
        (ti.tracking_type = 2 and b.bike_text like '%europe%')
        or
        (ti.tracking_type = 3 and t.truck_text like '%europe%')
       ) as total
  from tracking_items ti
  left join cars c on c.car_id = ti.tracking_object_id
  left join bikes b on c.bike_id = ti.tracking_object_id
  left join trucks t on t.truck_id = ti.tracking_object_id
  group by tracking_id
) agg on agg.tracking_id = t.id
where manufacture in ('1256703406078', '9600048390403', '1533405067830')
order by date_last_activity, id;
like image 4
Thorsten Kettner Avatar answered Nov 08 '22 11:11

Thorsten Kettner