Huge performance difference when using GROUP BY vs DISTINCT

Question

I am performing some tests on a HSQLDB server with a table containing 500 000 entries. The table has no indices. There are 5000 distinct business keys. I need a list of them.

Naturally I started with a DISTINCT query:

SELECT DISTINCT business_key FROM memory WHERE concept <> 'case'   OR        attrib  <> 'status' OR        value   <> 'closed';

It takes around 90 seconds!!!

Then I tried using GROUP BY:

SELECT business_key FROM memory WHERE concept <> 'case'   OR        attrib  <> 'status' OR       value   <> 'closed'; GROUP BY business_key

And it takes 1 second!!!

Trying to figure out the difference I ran EXLAIN PLAN FOR but it seems to give the same information for both queries.

EXLAIN PLAN FOR DISTINCT ...

isAggregated=[false] columns=[   COLUMN: PUBLIC.MEMORY.BUSINESS_KEY ] [range variable 1   join type=INNER   table=MEMORY   alias=M   access=FULL SCAN   condition = [    index=SYS_IDX_SYS_PK_10057_10058     other condition=[     OR arg_left=[      OR arg_left=[       NOT_EQUAL arg_left=[        COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[        VALUE = case, TYPE = CHARACTER]] arg_right=[       NOT_EQUAL arg_left=[        COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[        VALUE = status, TYPE = CHARACTER]]] arg_right=[      NOT_EQUAL arg_left=[       COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[       VALUE = closed, TYPE = CHARACTER]]]   ] ]] PARAMETERS=[] SUBQUERIES[] Object References PUBLIC.MEMORY PUBLIC.MEMORY.CONCEPT PUBLIC.MEMORY.ATTRIB PUBLIC.MEMORY.VALUE PUBLIC.MEMORY.BUSINESS_KEY Read Locks PUBLIC.MEMORY WriteLocks

EXLAIN PLAN FOR SELECT ... GROUP BY ...

isDistinctSelect=[false] isGrouped=[true] isAggregated=[false] columns=[   COLUMN: PUBLIC.MEMORY.BUSINESS_KEY ] [range variable 1   join type=INNER   table=MEMORY   alias=M   access=FULL SCAN   condition = [    index=SYS_IDX_SYS_PK_10057_10058     other condition=[     OR arg_left=[      OR arg_left=[       NOT_EQUAL arg_left=[        COLUMN: PUBLIC.MEMORY.CONCEPT] arg_right=[        VALUE = case, TYPE = CHARACTER]] arg_right=[       NOT_EQUAL arg_left=[        COLUMN: PUBLIC.MEMORY.ATTRIB] arg_right=[        VALUE = status, TYPE = CHARACTER]]] arg_right=[      NOT_EQUAL arg_left=[       COLUMN: PUBLIC.MEMORY.VALUE] arg_right=[       VALUE = closed, TYPE = CHARACTER]]]   ] ]] groupColumns=[ COLUMN: PUBLIC.MEMORY.BUSINESS_KEY] PARAMETERS=[] SUBQUERIES[] Object References PUBLIC.MEMORY PUBLIC.MEMORY.CONCEPT PUBLIC.MEMORY.ATTRIB PUBLIC.MEMORY.VALUE PUBLIC.MEMORY.BUSINESS_KEY Read Locks PUBLIC.MEMORY WriteLocks

EDIT

I did additional tests. With 500 000 records in HSQLDB with all distinct business keys, the performance of DISTINCT is now better - 3 seconds, vs GROUP BY which took around 9 seconds.

In MySQL both queries preform the same:

MySQL: 500 000 rows - 5 000 distinct business keys: Both queries: 0.5 second MySQL: 500 000 rows - all distinct business keys: SELECT DISTINCT ... - 11 seconds SELECT ... GROUP BY business_key - 13 seconds

So the problem is only related to HSQLDB.

I will be very grateful if someone can explain why there is such a drastic difference.

Andomar · Accepted Answer

The two queries express the same question. Apparently the query optimizer chooses two different execution plans. My guess would be that the distinct approach is executed like:

Copy all business_key values to a temporary table
Sort the temporary table
Scan the temporary table, returning each item that is different from the one before it

The group by could be executed like:

Scan the full table, storing each value of business key in a hashtable
Return the keys of the hashtable

The first method optimizes for memory usage: it would still perform reasonably well when part of the temporary table has to be swapped out. The second method optimizes for speed, but potentially requires a large amount of memory if there are a lot of different keys.

Since you either have enough memory or few different keys, the second method outperforms the first. It's not unusual to see performance differences of 10x or even 100x between two execution plans.

Huge performance difference when using GROUP BY vs DISTINCT

Tags:

performance

sql

group-by

distinct

hsqldb

EDIT

Martin Dimitrov

1 Answers

Andomar

Recent Activity

Donate For Us

Huge performance difference when using GROUP BY vs DISTINCT

Tags:

performance

sql

group-by

distinct

hsqldb

EDIT

Martin Dimitrov

1 Answers

Andomar

Related questions

Recent Activity

Donate For Us