GROUP or DISTINCT after JOIN returns duplicates

Tags:

I have two tables, products and meta. They are in relation 1:N where each product row has at least one meta row via foreign key.

(viz. SQLfiddle: http://sqlfiddle.com/#!15/c8f34/1)

I need to join these two tables but i need to filter only unique products. When I try this query, everything is ok (4 rows returned):

SELECT DISTINCT(product_id)
FROM meta JOIN products ON products.id = meta.product_id

but when I try to select all columns the DISTINCT rule no longer applies to results, as 8 rows instead of 4 is returned.

SELECT DISTINCT(product_id), *
FROM meta JOIN products ON products.id = meta.product_id

I have tried many approaches like trying to DISTINCT or GROUP BY on sub-query but always with same result.

717

asked Aug 25 '14 13:08

Raito Akehanareru

3 Answers

While retrieving all or most rows from a table, the fastest way for this type of query typically is to aggregate / disambiguate first and join later:

SELECT *
FROM   products p
JOIN  (
   SELECT DISTINCT ON (product_id) *
   FROM   meta
   ORDER  BY product_id, id DESC
   ) m ON m.product_id = p.id;

The more rows in meta per row in products, the bigger the impact on performance.

Of course, you'll want to add an ORDER BY clause in the subquery do define which row to pick form each set in the subquery. @Craig and @Clodoaldo already told you about that. I am returning the meta row with the highest id.

SQL Fiddle.

Details for DISTINCT ON:

Select first row in each GROUP BY group?

Optimize performance

Still, this is not always the fastest solution. Depending on data distribution there are various other query styles. For this simple case involving another join, this one ran considerably faster in a test with big tables:

SELECT p.*, sub.meta_id, m.product_id, m.price, m.flag
FROM  (
   SELECT product_id, max(id) AS meta_id
   FROM   meta
   GROUP  BY 1
   ) sub
JOIN meta     m ON m.id = sub.meta_id
JOIN products p ON p.id = sub.product_id;

If you wouldn't use the non-descriptive id as column names, we would not run into naming collisions and could simply write SELECT p.*, m.*. (I never use id as column name.)

If performance is your paramount requirement, consider more options:

a MATERIALIZED VIEW with pre-aggregated data from meta, if your data does not change (much).
a recursive CTE emulating a loose index scan for a big meta table with many rows per product (relatively few distinct product_id).
This is the only way I know to use an index for a DISTINCT query over the whole table.

103

answered Sep 29 '22 06:09

Erwin Brandstetter

I think you might be looking for DISTINCT ON, a PostgreSQL extension feature:

SELECT 
  DISTINCT ON(product_id)
  * 
FROM meta 
INNER JOIN products ON products.id = meta.product_id;

http://sqlfiddle.com/#!15/c8f34/18

However, note that without an ORDER BY the results are not guaranteed to be consistent; the database can pick any row it wants from the matching rows.

answered Sep 29 '22 07:09

Craig Ringer

Use distinct on as suggested by @Craig's answer but combined with the order by clause as explicated in the comments. SQL Fiddle

select distinct on(m.product_id) * 
from
    meta m
    inner join
    products p on p.id = m.product_id
order by m.product_id, m.id desc;

answered Sep 29 '22 07:09

Clodoaldo Neto

Related questions
                            
                                SQL Server Reporting Services 2008 R2 - Folder and Report Security
                            
                                how to round off to next 10 in oracle?
                            
                                How to speed up min/max aggregates in Postgres without an index that is unnecessary otherwise
                            
                                SQL Server: Convert varchar to decimal (with considering exponential notation as well)
                            
                                ERROR 1064 (42000): You have an error in your SQL syntax;
                            
                                JOIN, GROUP BY, ORDER BY
                            
                                How to run SQL in shell script
                            
                                Oracle duplicate row N times where N is a column
                            
                                How to tell if a sqlite column is AUTOINCREMENT?
                            
                                Left join multiple tables onto one table [closed]
                            
                                How to JOIN without relational table in Symfony Doctrine with QueryBuilder between 2 entities
                            
                                BCP in SQL command give NativeError = 2 when no -S parameter on local DB
                            
                                Looking for a scalar function to find the last occurrence of a character in a string
                            
                                Prepared statement returns false but row is inserted? [duplicate]
                            
                                Is it faster to alter multiple columns in the same query?
                            
                                Using case inside where clause
                            
                                How to create grouped daily,weekly and monthly reports including calculated fields in SQL Server
                            
                                In Sql Server, how do you put value from cursor into temp table?
                            
                                Avoid redundant updates
                            
                                UNION causes "Conversion failed when converting the varchar value to int"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

GROUP or DISTINCT after JOIN returns duplicates

Tags:

sql

join

postgresql

group-by

distinct