How do I (or can I) SELECT DISTINCT on multiple columns?

Tags:

I need to retrieve all rows from a table where 2 columns combined are all different. So I want all the sales that do not have any other sales that happened on the same day for the same price. The sales that are unique based on day and price will get updated to an active status.

So I'm thinking:

UPDATE sales SET status = 'ACTIVE' WHERE id IN (SELECT DISTINCT (saleprice, saledate), id, count(id)              FROM sales              HAVING count = 1)

But my brain hurts going any farther than that.

249

asked Sep 10 '08 15:09

sheats

2 Answers

SELECT DISTINCT a,b,c FROM t

is roughly equivalent to:

SELECT a,b,c FROM t GROUP BY a,b,c

It's a good idea to get used to the GROUP BY syntax, as it's more powerful.

For your query, I'd do it like this:

UPDATE sales SET status='ACTIVE' WHERE id IN (     SELECT id     FROM sales S     INNER JOIN     (         SELECT saleprice, saledate         FROM sales         GROUP BY saleprice, saledate         HAVING COUNT(*) = 1      ) T     ON S.saleprice=T.saleprice AND s.saledate=T.saledate  )

answered Oct 16 '22 02:10

Joel Coehoorn

If you put together the answers so far, clean up and improve, you would arrive at this superior query:

UPDATE sales SET    status = 'ACTIVE' WHERE  (saleprice, saledate) IN (     SELECT saleprice, saledate     FROM   sales     GROUP  BY saleprice, saledate     HAVING count(*) = 1      );

Which is much faster than either of them. Nukes the performance of the currently accepted answer by factor 10 - 15 (in my tests on PostgreSQL 8.4 and 9.1).

But this is still far from optimal. Use a NOT EXISTS (anti-)semi-join for even better performance. EXISTS is standard SQL, has been around forever (at least since PostgreSQL 7.2, long before this question was asked) and fits the presented requirements perfectly:

UPDATE sales s SET    status = 'ACTIVE' WHERE  NOT EXISTS (    SELECT FROM sales s1                     -- SELECT list can be empty for EXISTS    WHERE  s.saleprice = s1.saleprice    AND    s.saledate  = s1.saledate    AND    s.id <> s1.id                     -- except for row itself    ) AND    s.status IS DISTINCT FROM 'ACTIVE';  -- avoid empty updates. see below

db<>fiddle here
Old SQL Fiddle

Unique key to identify row

If you don't have a primary or unique key for the table (id in the example), you can substitute with the system column ctid for the purpose of this query (but not for some other purposes):

   AND    s1.ctid <> s.ctid

_{Every table should have a primary key. Add one if you didn't have one, yet. I suggest a serial or an IDENTITY column in Postgres 10+.}

In-order sequence generation
Auto increment table column

How is this faster?

The subquery in the EXISTS anti-semi-join can stop evaluating as soon as the first dupe is found (no point in looking further). For a base table with few duplicates this is only mildly more efficient. With lots of duplicates this becomes way more efficient.

Exclude empty updates

For rows that already have status = 'ACTIVE' this update would not change anything, but still insert a new row version at full cost (minor exceptions apply). Normally, you do not want this. Add another WHERE condition like demonstrated above to avoid this and make it even faster:

If status is defined NOT NULL, you can simplify to:

AND status <> 'ACTIVE';

The data type of the column must support the <> operator. Some types like json don't. See:

How to query a json column for empty objects?

Subtle difference in NULL handling

This query (unlike the currently accepted answer by Joel) does not treat NULL values as equal. The following two rows for (saleprice, saledate) would qualify as "distinct" (though looking identical to the human eye):

(123, NULL) (123, NULL)

Also passes in a unique index and almost anywhere else, since NULL values do not compare equal according to the SQL standard. See:

Create unique constraint with null columns

OTOH, GROUP BY, DISTINCT or DISTINCT ON () treat NULL values as equal. Use an appropriate query style depending on what you want to achieve. You can still use this faster query with IS NOT DISTINCT FROM instead of = for any or all comparisons to make NULL compare equal. More:

How to delete duplicate rows without unique identifier

If all columns being compared are defined NOT NULL, there is no room for disagreement.

answered Oct 16 '22 02:10

Erwin Brandstetter

Related questions
                            
                                Entity Framework VS LINQ to SQL VS ADO.NET with stored procedures? [closed]
                            
                                SQL MAX of multiple columns?
                            
                                Explicit vs implicit SQL joins
                            
                                Efficiently convert rows to columns in sql server
                            
                                How to select the nth row in a SQL database table?
                            
                                Cannot insert explicit value for identity column in table 'table' when IDENTITY_INSERT is set to OFF
                            
                                MySQL error: key specification without a key length
                            
                                Nested select statement in SQL Server
                            
                                SQL query return data from multiple tables
                            
                                how can I Update top 100 records in sql server
                            
                                How to print a query string with parameter values when using Hibernate
                            
                                Using LIMIT within GROUP BY to get N results per group?
                            
                                Delete all Duplicate Rows except for One in MySQL? [duplicate]
                            
                                When to use "ON UPDATE CASCADE"
                            
                                Exporting data In SQL Server as INSERT INTO
                            
                                What is the syntax for an inner join in LINQ to SQL?
                            
                                How to create id with AUTO_INCREMENT on Oracle?
                            
                                How to declare a variable in MySQL?
                            
                                Difference between EXISTS and IN in SQL?
                            
                                How to convert java.util.Date to java.sql.Date?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I (or can I) SELECT DISTINCT on multiple columns?

Tags:

sql

duplicates

postgresql

sql-update

distinct