Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PostgreSQL - "DISTINCT ON" and "GROUP BY" syntax

Tags:

sql

postgresql

I realized that a database query was returning unexpected results do to my improper use of "DISTINCT ON" and "GROUP BY"

I'm hoping someone can set me straight on this. The actual query is quite complex , so I'll dumb it down :

I have a table/inner query that consists of an object_id and a timestamp:

CREATE TABLE test_select ( object_id INT , event_timestamp timestamp );
COPY test_select (object_id , event_timestamp) FROM stdin (DELIMITER '|');
1           | 2013-01-27 21:01:20
1           | 2012-06-28 14:36:26
1           | 2013-02-21 04:16:48
2           | 2012-06-27 19:53:05
2           | 2013-02-03 17:35:58
3           | 2012-06-14 20:17:00
3           | 2013-02-15 19:03:34
4           | 2012-06-13 13:59:47
4           | 2013-02-23 06:31:16
5           | 2012-07-03 01:45:56
5           | 2012-06-11 21:33:26
\.

I'm trying to select a distinct ID , ordered/deduplicated by the timestamp on reverse chron

so the results should be [ 4, 1, 3, 2, 5 ]

I think this does what I need (it seems to ):

SELECT object_id  
FROM test_select 
GROUP BY object_id 
ORDER BY max(event_timestamp) DESC
;

For testing/auditing purposes , I sometimes want to include the timestamp field. I can't seem to figure out how to include another field with that query.

Can anyone point out glaring problems in my sql above, or suggestions on how to include the auditing info ?

like image 263
Jonathan Vanasco Avatar asked Sep 22 '13 00:09

Jonathan Vanasco


People also ask

Can we use GROUP BY and distinct together?

Well, GROUP BY and DISTINCT have their own use. GROUP BY cannot replace DISTINCT in some situations and DISTINCT cannot take place of GROUP BY. It is as per your choice and situation how you are optimizing both of them and choosing where to use GROUP BY and DISTINCT.

What is distinct on in PostgreSQL?

PostgreSQL also provides on an expression as DISTINCT ON that is used with the SELECT statement to remove duplicates from a query set result just like the DISTINCT clause.In addition to that it also keeps the “first row” of each row of duplicates in the query set result.

How do I write distinct in PostgreSQL?

Removing duplicate rows from a query result set in PostgreSQL can be done using the SELECT statement with the DISTINCT clause. It keeps one row for each group of duplicates. The DISTINCT clause can be used for a single column or for a list of columns.

Can we use distinct for multiple columns?

The DISTINCT clause is used in the SELECT statement to remove duplicate rows from a result set. The DISTINCT clause keeps one row for each group of duplicates. The DISTINCT clause can be applied to one or more columns in the select list of the SELECT statement.


2 Answers

To be able to select all columns and not only object_id and MAX(event_timestamp), you can use DISTINCT ON

SELECT DISTINCT ON (object_id) 
    object_id, event_timestamp    ---, more columns
FROM test_select 
ORDER BY object_id, event_timestamp DESC ;

If you want the results ordered by event_timestamp DESC and not by object_id, you need to include it in a derived table or a CTE:

SELECT *
FROM 
  ( SELECT DISTINCT ON (object_id) 
        object_id, event_timestamp    ---, more columns
    FROM test_select 
    ORDER BY object_id, event_timestamp DESC 
  ) AS t
ORDER BY event_timestamp DESC ;

Alternatively, you can use window functions, like ROW_NUMBER():

WITH cte AS
  ( SELECT ROW_NUMBER() OVER (PARTITION BY object_id 
                              ORDER BY event_timestamp DESC) 
             AS rn, 
           object_id, event_timestamp    ---, more columns
    FROM test_select 
  )
SELECT object_id, event_timestamp    ---, more columns
FROM cte
WHERE rn = 1
ORDER BY event_timestamp DESC ;

or aggregate MAX() with OVER:

WITH cte AS
  ( SELECT MAX(event_timestamp) OVER (PARTITION BY object_id) 
             AS max_event_timestamp, 
           object_id, event_timestamp    ---, more columns
    FROM test_select 
  )
SELECT object_id, event_timestamp    ---, more columns
FROM cte
WHERE event_timestamp = max_event_timestamp
ORDER BY event_timestamp DESC ;
like image 159
ypercubeᵀᴹ Avatar answered Oct 21 '22 22:10

ypercubeᵀᴹ


It is probably not the best way of dealing with this but you can try using window function:

SELECT DISTINCT object_id, MAX(event_timestamp)
OVER (PARTITION BY object_id)  
FROM test_select ORDER BY max DESC;

From the other hand it works as well:

SELECT object_id, MAX(event_timestamp) as max_event_timestamp
FROM test_select 
GROUP BY object_id 
ORDER BY max_event_timestamp DESC;
like image 28
zero323 Avatar answered Oct 21 '22 23:10

zero323