Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Distinct vs Group By

I have two tables like this. The 'order' table has 21886 rows.

CREATE TABLE `order` (
  `id` bigint(20) unsigned NOT NULL,
  `reg_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `idx_reg_date` (`reg_date`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci


CREATE TABLE `order_detail_products` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `order_id` bigint(20) unsigned NOT NULL,
  `order_detail_id` int(11) NOT NULL,
  `prod_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_order_detail_id` (`order_detail_id`,`prod_id`),
  KEY `idx_order_id` (`order_id`,`order_detail_id`,`prod_id`)
) ENGINE=InnoDB AUTO_INCREMENT=572375 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

My question is here.

MariaDB [test]> explain
    -> SELECT DISTINCT A.id
    -> FROM order A
    -> JOIN order_detail_products B ON A.id = B.order_id
    -> ORDER BY A.reg_date DESC LIMIT 100, 30;
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
| id   | select_type | table | type  | possible_keys | key          | key_len | ref               | rows  | Extra                                        |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
|    1 | SIMPLE      | A     | index | PRIMARY       | idx_reg_date | 8       | NULL              | 22151 | Using index; Using temporary; Using filesort |
|    1 | SIMPLE      | B     | ref   | idx_order_id  | idx_order_id | 8       | bom_20140804.A.id |     2 | Using index; Distinct                        |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
2 rows in set (0.00 sec)

MariaDB [test]> explain
    -> SELECT A.id
    -> FROM order A
    -> JOIN order_detail_products B ON A.id = B.order_id
    -> GROUP BY A.id
    -> ORDER BY A.reg_date DESC LIMIT 100, 30;
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+
| id   | select_type | table | type  | possible_keys | key          | key_len | ref               | rows | Extra                        |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+
|    1 | SIMPLE      | A     | index | PRIMARY       | idx_reg_date | 8       | NULL              |   65 | Using index; Using temporary |
|    1 | SIMPLE      | B     | ref   | idx_order_id  | idx_order_id | 8       | bom_20140804.A.id |    2 | Using index                  |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+

Listed above, two queries returns same result but distinct is too slow(explain too many rows). What's the difference?

like image 575
chris Avatar asked Aug 04 '14 08:08

chris


People also ask

Is distinct better than GROUP BY?

DISTINCT is used to filter unique records out of all records in the table. It removes the duplicate rows. SELECT DISTINCT will always be the same, or faster than a GROUP BY.

Is distinct and GROUP BY the same?

GROUP BY lets you use aggregate functions, like AVG , MAX , MIN , SUM , and COUNT . On the other hand DISTINCT just removes duplicates.

Why is GROUP BY faster than distinct?

The GROUP BY clause is faster than the SELECT DISTINCT clause (at least in these tests) because it does not require writing to a temporary file on disk.

Can we use distinct with GROUP BY?

Well, GROUP BY and DISTINCT have their own use. GROUP BY cannot replace DISTINCT in some situations and DISTINCT cannot take place of GROUP BY. It is as per your choice and situation how you are optimizing both of them and choosing where to use GROUP BY and DISTINCT.


1 Answers

It is usually advised to use DISTINCT instead of GROUP BY, since that is what you actually want, and let the optimizer choose the "best" execution plan. However - no optimizer is perfect. Using DISTINCT the optimizer can have more options for an execution plan. But that also means that it has more options to choose a bad plan.

You write that the DISTINCT query is "slow", but you don't tell any numbers. In my test (with 10 times as many rows on MariaDB 10.0.19 and 10.3.13) the DISTINCT query is like (only) 25% slower (562ms/453ms). The EXPLAIN result is no help at all. It's even "lying". With LIMIT 100, 30 it would need to read at least 130 rows (that's what my EXPLAIN actually schows for GROUP BY), but it shows you 65.

I can't explain the 25% difference in execution time, but it seems that the engine is doing a full table/index scan in any case, and sorts the result before it can skip 100 and select 30 rows.

The best plan would probably be:

  • Read rows from idx_reg_date index (table A) one by one in descending order
  • Look if there is a match in the idx_order_id index (table B)
  • Skip 100 matching rows
  • Send 30 matching rows
  • Exit

If there are like 10% of rows in A which have no match in B, this plan would read something like 143 rows from A.

Best I could do to somehow force this plan is:

SELECT A.id
FROM `order` A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30
OFFSET 100

This query returns the same result in 156 ms (3 times faster than GROUP BY). But that is still too slow. And it's probaly still reading all rows in table A.

We can proof that a better plan can exist with a "little" subquery trick:

SELECT A.id
FROM (
    SELECT id, reg_date
    FROM `order`
    ORDER BY reg_date DESC
    LIMIT 1000
) A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30
OFFSET 100

This query executes in "no time" (~ 0 ms) and returns the same result on my test data. And though it's not 100% reliable, it shows that the optimizer is not doing a good job.

So what are my conclusions:

  • The optimizer does not always do the best job and sometimes needs help
  • Even when we know "the best plan", we can not always enforce it
  • DISTINCT is not always faster than GROUP BY
  • When no index can be used for all clauses - things are getting quite tricky

Test schema and dummy data:

drop table if exists `order`;
CREATE TABLE `order` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `reg_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `idx_reg_date` (`reg_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

insert into `order`(reg_date)
    select from_unixtime(floor(rand(1) * 1000000000)) as reg_date
    from information_schema.COLUMNS a
       , information_schema.COLUMNS b
    limit 218860;

drop table if exists `order_detail_products`;
CREATE TABLE `order_detail_products` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `order_id` bigint(20) unsigned NOT NULL,
  `order_detail_id` int(11) NOT NULL,
  `prod_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_order_detail_id` (`order_detail_id`,`prod_id`),
  KEY `idx_order_id` (`order_id`,`order_detail_id`,`prod_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

insert into order_detail_products(id, order_id, order_detail_id, prod_id)
    select null as id
    , floor(rand(2)*218860)+1 as order_id
    , 0 as order_detail_id
    , 0 as prod_id
    from information_schema.COLUMNS a
       , information_schema.COLUMNS b
    limit 437320;

Queries:

SELECT DISTINCT A.id
FROM `order` A
JOIN order_detail_products B ON A.id = B.order_id
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 562 ms

SELECT A.id
FROM `order` A
JOIN order_detail_products B ON A.id = B.order_id
GROUP BY A.id
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 453 ms

SELECT A.id
FROM `order` A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 156 ms

SELECT A.id
FROM (
    SELECT id, reg_date
    FROM `order`
    ORDER BY reg_date DESC
    LIMIT 1000
) A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- ~ 0 ms
like image 126
Paul Spiegel Avatar answered Oct 17 '22 13:10

Paul Spiegel