I have been trying to improve query times for an existing Oracle database-driven application that has been running a little sluggish. The application executes several large queries, such as the one below, which can take over an hour to run. Replacing the DISTINCT
with a GROUP BY
clause in the query below shrank execution time from 100 minutes to 10 seconds. My understanding was that SELECT DISTINCT
and GROUP BY
operated in pretty much the same way. Why such a huge disparity between execution times? What is the difference in how the query is executed on the back-end? Is there ever a situation where SELECT DISTINCT
runs faster?
Note: In the following query, WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A'
represents just one of a number of ways that results can be filtered. This example was provided to show the reasoning for joining all of the tables that do not have columns included in the SELECT
and would result in about a tenth of all available data
SQL using DISTINCT
:
SELECT DISTINCT ITEMS.ITEM_ID, ITEMS.ITEM_CODE, ITEMS.ITEMTYPE, ITEM_TRANSACTIONS.STATUS, (SELECT COUNT(PKID) FROM ITEM_PARENTS WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID ) AS CHILD_COUNT FROM ITEMS INNER JOIN ITEM_TRANSACTIONS ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID AND ITEM_TRANSACTIONS.FLAG = 1 LEFT OUTER JOIN ITEM_METADATA ON ITEMS.ITEM_ID = ITEM_METADATA.ITEM_ID LEFT OUTER JOIN JOB_INVENTORY ON ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID LEFT OUTER JOIN JOB_TASK_INVENTORY ON JOB_INVENTORY.JOB_ITEM_ID = JOB_TASK_INVENTORY.JOB_ITEM_ID LEFT OUTER JOIN JOB_TASKS ON JOB_TASK_INVENTORY.TASKID = JOB_TASKS.TASKID LEFT OUTER JOIN JOBS ON JOB_TASKS.JOB_ID = JOBS.JOB_ID LEFT OUTER JOIN TASK_INVENTORY_STEP ON JOB_INVENTORY.JOB_ITEM_ID = TASK_INVENTORY_STEP.JOB_ITEM_ID LEFT OUTER JOIN TASK_STEP_INFORMATION ON TASK_INVENTORY_STEP.JOB_ITEM_ID = TASK_STEP_INFORMATION.JOB_ITEM_ID WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A' ORDER BY ITEMS.ITEM_CODE
SQL using GROUP BY
:
SELECT ITEMS.ITEM_ID, ITEMS.ITEM_CODE, ITEMS.ITEMTYPE, ITEM_TRANSACTIONS.STATUS, (SELECT COUNT(PKID) FROM ITEM_PARENTS WHERE PARENT_ITEM_ID = ITEMS.ITEM_ID ) AS CHILD_COUNT FROM ITEMS INNER JOIN ITEM_TRANSACTIONS ON ITEMS.ITEM_ID = ITEM_TRANSACTIONS.ITEM_ID AND ITEM_TRANSACTIONS.FLAG = 1 LEFT OUTER JOIN ITEM_METADATA ON ITEMS.ITEM_ID = ITEM_METADATA.ITEM_ID LEFT OUTER JOIN JOB_INVENTORY ON ITEMS.ITEM_ID = JOB_INVENTORY.ITEM_ID LEFT OUTER JOIN JOB_TASK_INVENTORY ON JOB_INVENTORY.JOB_ITEM_ID = JOB_TASK_INVENTORY.JOB_ITEM_ID LEFT OUTER JOIN JOB_TASKS ON JOB_TASK_INVENTORY.TASKID = JOB_TASKS.TASKID LEFT OUTER JOIN JOBS ON JOB_TASKS.JOB_ID = JOBS.JOB_ID LEFT OUTER JOIN TASK_INVENTORY_STEP ON JOB_INVENTORY.JOB_ITEM_ID = TASK_INVENTORY_STEP.JOB_ITEM_ID LEFT OUTER JOIN TASK_STEP_INFORMATION ON TASK_INVENTORY_STEP.JOB_ITEM_ID = TASK_STEP_INFORMATION.JOB_ITEM_ID WHERE TASK_INVENTORY_STEP.STEP_TYPE = 'TYPE A' GROUP BY ITEMS.ITEM_ID, ITEMS.ITEM_CODE, ITEMS.ITEMTYPE, ITEM_TRANSACTIONS.STATUS ORDER BY ITEMS.ITEM_CODE
Here is the Oracle query plan for the query using DISTINCT
:
Here is the Oracle query plan for the query using GROUP BY
:
DISTINCT is used to filter unique records out of all records in the table. It removes the duplicate rows. SELECT DISTINCT will always be the same, or faster than a GROUP BY.
In summary: GROUP BY is slightly faster than SELECT DISTINCT.
DISTINCT would usually be faster than GROUP BY if a) there's no index on that column and b) you are not ordering as well since GROUP BY does both filtering and ordering.
When and where to use GROUP BY and DISTINCT. DISTINCT is used to filter unique records out of the records that satisfy the query criteria. The "GROUP BY" clause is used when you need to group the data and it should be used to apply aggregate operators to each group.
The performance difference is probably due to the execution of the subquery in the SELECT
clause. I am guessing that it is re-executing this query for every row before the distinct. For the group by
, it would execute once after the group by.
Try replacing it with a join, instead:
select . . ., parentcnt from . . . left outer join (SELECT PARENT_ITEM_ID, COUNT(PKID) as parentcnt FROM ITEM_PARENTS ) p on items.item_id = p.parent_item_id
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With