Distinct vs Group By

Tags:

I have two tables like this. The 'order' table has 21886 rows.

CREATE TABLE `order` (
  `id` bigint(20) unsigned NOT NULL,
  `reg_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `idx_reg_date` (`reg_date`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci


CREATE TABLE `order_detail_products` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `order_id` bigint(20) unsigned NOT NULL,
  `order_detail_id` int(11) NOT NULL,
  `prod_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_order_detail_id` (`order_detail_id`,`prod_id`),
  KEY `idx_order_id` (`order_id`,`order_detail_id`,`prod_id`)
) ENGINE=InnoDB AUTO_INCREMENT=572375 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

My question is here.

MariaDB [test]> explain
    -> SELECT DISTINCT A.id
    -> FROM order A
    -> JOIN order_detail_products B ON A.id = B.order_id
    -> ORDER BY A.reg_date DESC LIMIT 100, 30;
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
| id   | select_type | table | type  | possible_keys | key          | key_len | ref               | rows  | Extra                                        |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
|    1 | SIMPLE      | A     | index | PRIMARY       | idx_reg_date | 8       | NULL              | 22151 | Using index; Using temporary; Using filesort |
|    1 | SIMPLE      | B     | ref   | idx_order_id  | idx_order_id | 8       | bom_20140804.A.id |     2 | Using index; Distinct                        |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+-------+----------------------------------------------+
2 rows in set (0.00 sec)

MariaDB [test]> explain
    -> SELECT A.id
    -> FROM order A
    -> JOIN order_detail_products B ON A.id = B.order_id
    -> GROUP BY A.id
    -> ORDER BY A.reg_date DESC LIMIT 100, 30;
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+
| id   | select_type | table | type  | possible_keys | key          | key_len | ref               | rows | Extra                        |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+
|    1 | SIMPLE      | A     | index | PRIMARY       | idx_reg_date | 8       | NULL              |   65 | Using index; Using temporary |
|    1 | SIMPLE      | B     | ref   | idx_order_id  | idx_order_id | 8       | bom_20140804.A.id |    2 | Using index                  |
+------+-------------+-------+-------+---------------+--------------+---------+-------------------+------+------------------------------+

Listed above, two queries returns same result but distinct is too slow(explain too many rows). What's the difference?

575

asked Aug 04 '14 08:08

chris

1 Answers

It is usually advised to use DISTINCT instead of GROUP BY, since that is what you actually want, and let the optimizer choose the "best" execution plan. However - no optimizer is perfect. Using DISTINCT the optimizer can have more options for an execution plan. But that also means that it has more options to choose a bad plan.

You write that the DISTINCT query is "slow", but you don't tell any numbers. In my test (with 10 times as many rows on MariaDB 10.0.19 and 10.3.13) the DISTINCT query is like (only) 25% slower (562ms/453ms). The EXPLAIN result is no help at all. It's even "lying". With LIMIT 100, 30 it would need to read at least 130 rows (that's what my EXPLAIN actually schows for GROUP BY), but it shows you 65.

I can't explain the 25% difference in execution time, but it seems that the engine is doing a full table/index scan in any case, and sorts the result before it can skip 100 and select 30 rows.

The best plan would probably be:

Read rows from idx_reg_date index (table A) one by one in descending order
Look if there is a match in the idx_order_id index (table B)
Skip 100 matching rows
Send 30 matching rows
Exit

If there are like 10% of rows in A which have no match in B, this plan would read something like 143 rows from A.

Best I could do to somehow force this plan is:

SELECT A.id
FROM `order` A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30
OFFSET 100

This query returns the same result in 156 ms (3 times faster than GROUP BY). But that is still too slow. And it's probaly still reading all rows in table A.

We can proof that a better plan can exist with a "little" subquery trick:

SELECT A.id
FROM (
    SELECT id, reg_date
    FROM `order`
    ORDER BY reg_date DESC
    LIMIT 1000
) A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30
OFFSET 100

This query executes in "no time" (~ 0 ms) and returns the same result on my test data. And though it's not 100% reliable, it shows that the optimizer is not doing a good job.

So what are my conclusions:

The optimizer does not always do the best job and sometimes needs help
Even when we know "the best plan", we can not always enforce it
DISTINCT is not always faster than GROUP BY
When no index can be used for all clauses - things are getting quite tricky

Test schema and dummy data:

drop table if exists `order`;
CREATE TABLE `order` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `reg_date` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  KEY `idx_reg_date` (`reg_date`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

insert into `order`(reg_date)
    select from_unixtime(floor(rand(1) * 1000000000)) as reg_date
    from information_schema.COLUMNS a
       , information_schema.COLUMNS b
    limit 218860;

drop table if exists `order_detail_products`;
CREATE TABLE `order_detail_products` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `order_id` bigint(20) unsigned NOT NULL,
  `order_detail_id` int(11) NOT NULL,
  `prod_id` int(11) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `idx_order_detail_id` (`order_detail_id`,`prod_id`),
  KEY `idx_order_id` (`order_id`,`order_detail_id`,`prod_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci;

insert into order_detail_products(id, order_id, order_detail_id, prod_id)
    select null as id
    , floor(rand(2)*218860)+1 as order_id
    , 0 as order_detail_id
    , 0 as prod_id
    from information_schema.COLUMNS a
       , information_schema.COLUMNS b
    limit 437320;

Queries:

SELECT DISTINCT A.id
FROM `order` A
JOIN order_detail_products B ON A.id = B.order_id
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 562 ms

SELECT A.id
FROM `order` A
JOIN order_detail_products B ON A.id = B.order_id
GROUP BY A.id
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 453 ms

SELECT A.id
FROM `order` A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- 156 ms

SELECT A.id
FROM (
    SELECT id, reg_date
    FROM `order`
    ORDER BY reg_date DESC
    LIMIT 1000
) A
WHERE EXISTS (SELECT * FROM order_detail_products B WHERE A.id = B.order_id)
ORDER BY A.reg_date DESC
LIMIT 30 OFFSET 100;
-- ~ 0 ms

126

answered Oct 17 '22 13:10

Paul Spiegel

Related questions
                            
                                Mysql connect to server: Access denied for user root@localhost
                            
                                Docker compose of mutiple sub-website
                            
                                Calculating the relevance of a User based on Specific data
                            
                                Is it possible to use Cacti to monitor MySQL on Amazon's RDS?
                            
                                Accent-insensitive sorting in MySQL
                            
                                turn on mysql profiler globally
                            
                                How do you use the mvc-mini-profiler with Entity Framework 4.1
                            
                                Writing MySQL Node.js models using node-mysql
                            
                                How to migrate database from Postgres to MySQL? [closed]
                            
                                How to check against all joins when generating a score using MYSQL
                            
                                Rails - How to force associations to use alias table name
                            
                                What is SQL to select a property and the max number of occurrences of a related property?
                            
                                MySQL sub-query WHERE filtering out too much
                            
                                insufficient history for index
                            
                                Clicking 'Refresh' crashes phpMyAdmin
                            
                                Lexing partial SQL in C#
                            
                                java alternative for phpMyAdmin [closed]
                            
                                SQL LIMIT returns no results where no LIMIT returns results
                            
                                Complex time-series statistical aggregation involving polymorphic associations
                            
                                Single mysql select on join tables to a json hierarchical fragment, how?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Distinct vs Group By

Tags:

mysql

group-by

distinct