I am trying to write a query in Postgresql that pulls a set of ordered data and filters it by a distinct field. I also need to pull several other fields from the same table row, but they need to be left out of the distinct evaluation. example:
SELECT DISTINCT(user_id) user_id,
created_at
FROM creations
ORDER BY created_at
LIMIT 20
I need the user_id
to be DISTINCT
, but don't care whether the created_at date is unique or not. Because the created_at date is being included in the evaluation, I am getting duplicate user_id
in my result set.
Also, the data must be ordered by the date, so using DISTINCT ON
is not an option here. It required that the DISTINCT ON
field be the first field in the ORDER BY
clause and that does not deliver the results that I seek.
How do I properly use the DISTINCT
clause but limit its scope to only one field while still selecting other fields?
The SELECT DISTINCT statement is used to return only distinct (different) values. Inside a table, a column often contains many duplicate values; and sometimes you only want to list the different (distinct) values.
Yes, DISTINCT works on all combinations of column values for all columns in the SELECT clause.
SELECT * FROM dup_table; Now let's retrieve distinct rows without using the DISTINCT clause.
Adding the DISTINCT keyword to a SELECT query causes it to return only unique values for the specified column list so that duplicate rows are removed from the result set. Since DISTINCT operates on all of the fields in SELECT's column list, it can't be applied to an individual field that are part of a larger group.
As you've discovered, standard SQL treats DISTINCT
as applying to the whole select-list, not just one column or a few columns. The reason for this is that it's ambiguous what value to put in the columns you exclude from the DISTINCT
. For the same reason, standard SQL doesn't allow you to have ambiguous columns in a query with GROUP BY
.
But PostgreSQL has a nonstandard extension to SQL to allow for what you're asking: DISTINCT ON (expr)
.
SELECT DISTINCT ON (user_id) user_id, created_at
FROM creations
ORDER BY user_id, created_at
LIMIT 20
You have to include the distinct expression(s) as the leftmost part of your ORDER BY clause.
See the manual on DISTINCT Clause for more information.
If you want the most recent created_at for each user then I suggest you aggregate like this:
SELECT user_id, MAX(created_at)
FROM creations
WHERE ....
GROUP BY user_id
ORDER BY created_at DESC
This will return the most recent created_at for each user_id If you only want the top 20, then append
LIMIT 20
EDIT: This is basically the same thing Unreason said above... define from which row you want the data by aggregation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With