Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL select distinct but "keep first"?

Tags:

sql

mysql

According to another SO post (SQL: How to keep rows order with DISTINCT?), distinct has pretty undefined behavior as far as sorting.

I have a query:

select col_1 from table order by col_2

This can return values like

3
5
3
2

I need to then select a distinct on these that preserves ordering, meaning I want

select distinct(col_1) from table order by col_2 

to return

3
5
2

but not

5
3
2

Here is what I am actually trying to do. Col_1 is a user id, and col_2 is a log in timestamp event by that user. So the same user (col_1) can have many login times. I am trying to build a historical list of users in which they were seen in the system. I would like to be able to say "our first user ever was, our second user ever was", and so on.

That post seems to suggest to use a group by, but group by is not meant to return an ordering of rows, so I do not see how or why this would be applicable here, since it does not appear group by will preserve any ordering. In fact, another SO post gives an example where group by will destroy the ordering I am looking for: see "Peter" in what is the difference between GROUP BY and ORDER BY in sql. Is there anyway to guarantee the latter result? The strange thing is, if I were implementing the DISTINCT clause, I would surely do the order by first, then take the results and do a linear scan of the list and preserve the ordering naturally, so I am not sure why the behavior is so undefined.

EDIT:

Thank you all! I have accepted IMSoP answer because not only was there an interative example that I could play around with (thanks for turning me on to SQL Fiddle), but they also explained why several things worked the way they worked, instead of simply "do this". Specifically, it was unclear that GROUP BY does not destroy (rather, keeps them in some sort of internal list) values in the other columns outside of the group by, and these values can still be examined in an ORDER BY clause.

like image 290
Tommy Avatar asked Oct 16 '13 21:10

Tommy


People also ask

Does distinct take the first?

Bare usage of DISTINCT will return the first occurrence. However, it can work either way by sorting the initial results first before conducting the distinction step.

Can we use distinct with top?

It works simply if you use query like this: SELECT DISTINCT TOP 2 name FROM [ATTENDANCE] ; In the above query, name is the column_name and [ATTENDANCE] is the table_name. You can also use WHERE with this to make filtering conditions.

Can you use distinct with ORDER BY?

Without a transformation, a statement that contains both DISTINCT and ORDER BY would require two separate sorting steps-one to satisfy DISTINCT and one to satisfy ORDER BY. (Currently, Derby uses sorting to evaluate DISTINCT.

Can we use distinct with ORDER BY in SQL?

There is no way this query can be executed reasonably. Either DISTINCT doesn't work (because the added extended sort key column changes its semantics), or ORDER BY doesn't work (because after DISTINCT we can no longer access the extended sort key column).


3 Answers

This all has to do with the "logical ordering" of SQL statements. Although a DBMS might actually retrieve the data according to all sorts of clever strategies, it has to behave according to some predictable logic. As such, the different parts of an SQL query can be considered to be processed "before" or "after" one another in terms of how that logic behaves.

As it happens, the ORDER BY clause is the very last step in that logical sequence, so it can't change the behaviour of "earlier" steps.

If you use a GROUP BY, the rows have been bundled up into their groups by the time the SELECT clause is run, let alone the ORDER BY, so you can only look at columns which have been grouped by, or "aggregate" values calculated across all the values in a group. (MySQL implements a controversial extension to GROUP BY where you can mention a column in the SELECT that can't logically be there, and it will pick one from an arbitrary row in that group).

If you use a DISTINCT, it is logically processed after the SELECT, but the ORDER BY still comes afterwards. So only once the DISTINCT has thrown away the duplicates will the remaining results be put into a particular order - but the rows that have been thrown away can't be used to determine that order.


As for how to get the result you need, the key is to find a value to sort by which is valid after the GROUP BY/DISTINCT has (logically) been run. Remember that if you use a GROUP BY, any aggregated values are still valid - an aggregate function can look at all the values in a group. This includes MIN() and MAX(), which are ideal for ordering by, because "the lowest number" (MIN) is the same thing as "the first number if I sort them in ascending order", and vice versa for MAX.

So to order a set of distinct foo_number values based on the lowest applicable bar_number for each, you could use this:

SELECT foo_number
FROM some_table
GROUP BY foo_number
ORDER BY MIN(bar_number) ASC

Here's a live demo with some arbitrary data.


EDIT: In the comments, it was discussed why, if an ordering is applied before the grouping / de-duplication takes place, that order is not applied to the groups. If that were the case, you would still need a strategy for which row was kept in each group: the first, or the last.

As an analogy, picture the original set of rows as a set of playing cards picked from a deck, and then sorted by their face value, low to high. Now go through the sorted deck and deal them into a separate pile for each suit. Which card should "represent" each pile?

If you deal the cards face up, the cards showing at the end will be the ones with the highest face value (a "keep last" strategy); if you deal them face down and then flip each pile, you will reveal the lowest face value (a "keep first" strategy). Both are obeying the original order of the cards, and the instruction to "deal the cards based on suit" doesn't automatically tell the dealer (who represents the DBMS) which strategy was intended.

If the final piles of cards are the groups from a GROUP BY, then MIN() and MAX() represent picking up each pile and looking for the lowest or highest value, regardless of the order they are in. But because you can look inside the groups, you can do other things too, like adding up the total value of each pile (SUM) or how many cards there are (COUNT) etc, making GROUP BY much more powerful than an "ordered DISTINCT" could be.

like image 57
IMSoP Avatar answered Sep 18 '22 15:09

IMSoP


I would go for something like

select col1
from (
select col1,
       rank () over(order by col2) pos
from table
)
group by col1
order by min(pos)

In the subquery I calculate the position, then in the main query I do a group by on col1, using the smallest position to order.

Here the demo in SQLFiddle (this was Oracle, the MySql info was added later.

Edit for MySql:

select col1
from (
select col1 col1,
       @curRank := @curRank + 1 AS pos
from table1, (select @curRank := 0) p
) sub
group by col1
order by min(pos)

And here the demo for MySql.

like image 39
mucio Avatar answered Sep 18 '22 15:09

mucio


The GROUP BY in the referenced answer isn't attempting to perform an ordering... it is simply picking a single associated value for the column that we want to be distinct.

Like @bluefeet states, if you want a guaranteed ordering, you must use ORDER BY.

Why can't we specify a value in the ORDER BY that isn't included in the SELECT DISTINCT?

Consider the following values for col1 and col2:

create table yourTable (
  col_1 int,
  col_2 int
);

insert into yourTable (col_1, col_2) values (1, 1);
insert into yourTable (col_1, col_2) values (1, 3);
insert into yourTable (col_1, col_2) values (2, 2);
insert into yourTable (col_1, col_2) values (2, 4);

With this data, what should SELECT DISTINCT col_1 FROM yourTable ORDER BY col_2 return?

That's why you need the GROUP BY and the aggregate function, to decide which of the multiple values for col_2 you should order by... could be MIN(), could be MAX(), maybe even some other function such as AVG() would make sense in some cases; it all depends on the specific scenario, which is why you need to be explicit:

select col_1
from yourTable
group by col_1
order by min(col_2)

SQL Fiddle Here

like image 29
Michael Fredrickson Avatar answered Sep 21 '22 15:09

Michael Fredrickson