Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DISTINCT with PARTITION BY vs. GROUPBY

I have found some SQL queries in an application I am examining like this:

SELECT DISTINCT
Company, Warehouse, Item,
SUM(quantity) OVER (PARTITION BY Company, Warehouse, Item) AS stock

I'm quite sure this gives the same result as:

SELECT
Company, Warehouse, Item,
SUM(quantity) AS stock
GROUP BY Company, Warehouse, Item

Is there any benefit (performance, readability, additional flexibility in writing the query, maintainability, etc.) of using the first approach over the later?

like image 620
Andris Avatar asked Dec 04 '13 12:12

Andris


People also ask

Is it better to use distinct or GROUP BY?

The group gives the same result as distinct when no aggregate function is present. GROUP BY is required if you're aggregating data, but in many cases, DISTINCT is simpler to write and read if you aren't aggregating data.

Is partition by faster than GROUP BY?

However, it's still slower than the GROUP BY. The IO for the PARTITION BY is now much less than for the GROUP BY, but the CPU for the PARTITION BY is still much higher. Even when there is lots of memory, PARTITION BY – and many analytical functions – are very CPU intensive.

Can you use distinct in partition by?

Count Distinct is not supported by window partitioning, we need to find a different way to achieve the same result.

Can we use partition by and GROUP BY together?

Therefore, in conclusion, the PARTITION BY retrieves all the records in the table, while the GROUP BY only returns a limited number. One more thing is that GROUP BY does not allow to add columns which are not parts of GROUP BY clause in select statement. However, with PARTITION BY clause, we can add required columns.


1 Answers

Performance:

Winner: GROUP BY

Some very rudimentary testing on a large table with unindexed columns showed that at least in my case the two queries generated a completely different query plan. The one for PARTITION BY was significantly slower.

The GROUP BY query plan included only a table scan and aggregation operation while the PARTITION BY plan had two nested loop self-joins. The PARTITION BY took about 2800ms on the second run, the GROUP BY took only 500ms.

Readability / Maintainability:

Winner: GROUP BY

Based on the opinions of the commenters here the PARTITION BY is less readable for most developers so it will be probably also harder to maintain in the future.

Flexibility

Winner: PARTITION BY

PARTITION BY gives you more flexibility in choosing the grouping columns. With GROUP BY you can have only one set of grouping columns for all aggregated columns. With DISTINCT + PARTITION BY you can have different column in each partition. Also on some DBMSs you can chose from more aggregation/analytic functions in the OVER clause.

like image 104
Andris Avatar answered Oct 10 '22 04:10

Andris