I have two tables, products
and meta
. They are in relation 1:N where each product row has at least one meta row via foreign key.
(viz. SQLfiddle: http://sqlfiddle.com/#!15/c8f34/1)
I need to join these two tables but i need to filter only unique products. When I try this query, everything is ok (4 rows returned):
SELECT DISTINCT(product_id)
FROM meta JOIN products ON products.id = meta.product_id
but when I try to select all columns the DISTINCT rule no longer applies to results, as 8 rows instead of 4 is returned.
SELECT DISTINCT(product_id), *
FROM meta JOIN products ON products.id = meta.product_id
I have tried many approaches like trying to DISTINCT
or GROUP BY
on sub-query but always with same result.
Solution. Select column values in a specific order within rows to make rows with duplicate sets of values identical. Then you can use SELECT DISTINCT to remove duplicates. Alternatively, retrieve rows in such a way that near-duplicates are not even selected.
DISTINCT is used to filter unique records out of all records in the table. It removes the duplicate rows. SELECT DISTINCT will always be the same, or faster than a GROUP BY.
A left join will produce duplicates in a 1-to-many relationship. Regardless of how many records are in your first table, if you left join to a table with multiple rows for each record in the first table you'll get more than one row.
Check for Duplicates in Multiple Tables With INNER JOIN Use the INNER JOIN function to find duplicates that exist in multiple tables. Sample syntax for an INNER JOIN function looks like this: SELECT column_name FROM table1 INNER JOIN table2 ON table1.
While retrieving all or most rows from a table, the fastest way for this type of query typically is to aggregate / disambiguate first and join later:
SELECT *
FROM products p
JOIN (
SELECT DISTINCT ON (product_id) *
FROM meta
ORDER BY product_id, id DESC
) m ON m.product_id = p.id;
The more rows in meta
per row in products
, the bigger the impact on performance.
Of course, you'll want to add an ORDER BY
clause in the subquery do define which row to pick form each set in the subquery. @Craig and @Clodoaldo already told you about that. I am returning the meta
row with the highest id
.
SQL Fiddle.
Details for DISTINCT ON
:
Still, this is not always the fastest solution. Depending on data distribution there are various other query styles. For this simple case involving another join, this one ran considerably faster in a test with big tables:
SELECT p.*, sub.meta_id, m.product_id, m.price, m.flag
FROM (
SELECT product_id, max(id) AS meta_id
FROM meta
GROUP BY 1
) sub
JOIN meta m ON m.id = sub.meta_id
JOIN products p ON p.id = sub.product_id;
If you wouldn't use the non-descriptive id
as column names, we would not run into naming collisions and could simply write SELECT p.*, m.*
. (I never use id
as column name.)
If performance is your paramount requirement, consider more options:
MATERIALIZED VIEW
with pre-aggregated data from meta
, if your data does not change (much).meta
table with many rows per product (relatively few distinct product_id
).I think you might be looking for DISTINCT ON
, a PostgreSQL extension feature:
SELECT
DISTINCT ON(product_id)
*
FROM meta
INNER JOIN products ON products.id = meta.product_id;
http://sqlfiddle.com/#!15/c8f34/18
However, note that without an ORDER BY
the results are not guaranteed to be consistent; the database can pick any row it wants from the matching rows.
Use distinct on
as suggested by @Craig's answer but combined with the order by
clause as explicated in the comments. SQL Fiddle
select distinct on(m.product_id) *
from
meta m
inner join
products p on p.id = m.product_id
order by m.product_id, m.id desc;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With