I want to run this query:
SELECT DISTINCT ON (address_id) purchases.address_id, purchases.* FROM purchases WHERE purchases.product_id = 1 ORDER BY purchases.purchased_at DESC
But I get this error:
PG::Error: ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions
Adding address_id
as first ORDER BY
expression silences the error, but I really don't want to add sorting over address_id
. Is it possible to do without ordering by address_id
?
Either DISTINCT doesn't work (because the added extended sort key column changes its semantics), or ORDER BY doesn't work (because after DISTINCT we can no longer access the extended sort key column).
Removing duplicate rows from a query result set in PostgreSQL can be done using the SELECT statement with the DISTINCT clause. It keeps one row for each group of duplicates. The DISTINCT clause can be used for a single column or for a list of columns.
Answer. Yes, the DISTINCT clause can be applied to any valid SELECT query. It is important to note that DISTINCT will filter out all rows that are not unique in terms of all selected columns.
From experiments, I founded that the GROUP BY is 10+ times faster than DISTINCT. They are different. So what I learned is: GROUP-BY is anyway not worse than DISTINCT, and it is better sometimes.
Documentation says:
DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. [...] Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first. [...] The DISTINCT ON expression(s) must match the leftmost ORDER BY expression(s).
Official documentation
So you'll have to add the address_id
to the order by.
Alternatively, if you're looking for the full row that contains the most recent purchased product for each address_id
and that result sorted by purchased_at
then you're trying to solve a greatest N per group problem which can be solved by the following approaches:
The general solution that should work in most DBMSs:
SELECT t1.* FROM purchases t1 JOIN ( SELECT address_id, max(purchased_at) max_purchased_at FROM purchases WHERE product_id = 1 GROUP BY address_id ) t2 ON t1.address_id = t2.address_id AND t1.purchased_at = t2.max_purchased_at ORDER BY t1.purchased_at DESC
A more PostgreSQL-oriented solution based on @hkf's answer:
SELECT * FROM ( SELECT DISTINCT ON (address_id) * FROM purchases WHERE product_id = 1 ORDER BY address_id, purchased_at DESC ) t ORDER BY purchased_at DESC
Problem clarified, extended and solved here: Selecting rows ordered by some column and distinct on another
A subquery can solve it:
SELECT * FROM ( SELECT DISTINCT ON (address_id) * FROM purchases WHERE product_id = 1 ) p ORDER BY purchased_at DESC;
Leading expressions in ORDER BY
have to agree with columns in DISTINCT ON
, so you can't order by different columns in the same SELECT
.
Only use an additional ORDER BY
in the subquery if you want to pick a particular row from each set:
SELECT * FROM ( SELECT DISTINCT ON (address_id) * FROM purchases WHERE product_id = 1 ORDER BY address_id, purchased_at DESC -- get "latest" row per address_id ) p ORDER BY purchased_at DESC;
If purchased_at
can be NULL
, use DESC NULLS LAST
- and match your index for best performance. See:
Related, with more explanation:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With