Get records with highest/smallest <whatever> per group

Tags:

How to do that?

Former title of this question was "using rank (@Rank := @Rank + 1) in complex query with subqueries - will it work?" because I was looking for solution using ranks, but now I see that the solution posted by Bill is much much better.

Original question:

I'm trying to compose a query that would take last record from each group given some defined order:

SET @Rank=0;  select s.* from (select GroupId, max(Rank) AS MaxRank       from (select GroupId, @Rank := @Rank + 1 AS Rank              from Table             order by OrderField             ) as t       group by GroupId) as t    join (       select *, @Rank := @Rank + 1 AS Rank       from Table       order by OrderField       ) as s    on t.GroupId = s.GroupId and t.MaxRank = s.Rank order by OrderField

Expression @Rank := @Rank + 1 is normally used for rank, but for me it looks suspicious when used in 2 subqueries, but initialized only once. Will it work this way?

And second, will it work with one subquery that is evaluated multiple times? Like subquery in where (or having) clause (another way how to write the above):

SET @Rank=0;  select Table.*, @Rank := @Rank + 1 AS Rank from Table having Rank = (select max(Rank) AS MaxRank               from (select GroupId, @Rank := @Rank + 1 AS Rank                      from Table as t0                     order by OrderField                     ) as t               where t.GroupId = table.GroupId              ) order by OrderField

Thanks in advance!

495

asked Jan 05 '12 20:01

Tomas

1 Answers

So you want to get the row with the highest OrderField per group? I'd do it this way:

SELECT t1.* FROM `Table` AS t1 LEFT OUTER JOIN `Table` AS t2   ON t1.GroupId = t2.GroupId AND t1.OrderField < t2.OrderField WHERE t2.GroupId IS NULL ORDER BY t1.OrderField; // not needed! (note by Tomas)

(EDIT by Tomas: If there are more records with the same OrderField within the same group and you need exactly one of them, you may want to extend the condition:

SELECT t1.* FROM `Table` AS t1 LEFT OUTER JOIN `Table` AS t2   ON t1.GroupId = t2.GroupId          AND (t1.OrderField < t2.OrderField           OR (t1.OrderField = t2.OrderField AND t1.Id < t2.Id)) WHERE t2.GroupId IS NULL

end of edit.)

In other words, return the row t1 for which no other row t2 exists with the same GroupId and a greater OrderField. When t2.* is NULL, it means the left outer join found no such match, and therefore t1 has the greatest value of OrderField in the group.

No ranks, no subqueries. This should run fast and optimize access to t2 with "Using index" if you have a compound index on (GroupId, OrderField).

Regarding performance, see my answer to Retrieving the last record in each group. I tried a subquery method and the join method using the Stack Overflow data dump. The difference is remarkable: the join method ran 278 times faster in my test.

It's important that you have the right index to get the best results!

Regarding your method using the @Rank variable, it won't work as you've written it, because the values of @Rank won't reset to zero after the query has processed the first table. I'll show you an example.

I inserted some dummy data, with an extra field that is null except on the row we know is the greatest per group:

select * from `Table`;  +---------+------------+------+ | GroupId | OrderField | foo  | +---------+------------+------+ |      10 |         10 | NULL | |      10 |         20 | NULL | |      10 |         30 | foo  | |      20 |         40 | NULL | |      20 |         50 | NULL | |      20 |         60 | foo  | +---------+------------+------+

We can show that the rank increases to three for the first group and six for the second group, and the inner query returns these correctly:

select GroupId, max(Rank) AS MaxRank from (   select GroupId, @Rank := @Rank + 1 AS Rank   from `Table`   order by OrderField) as t group by GroupId  +---------+---------+ | GroupId | MaxRank | +---------+---------+ |      10 |       3 | |      20 |       6 | +---------+---------+

Now run the query with no join condition, to force a Cartesian product of all rows, and we also fetch all columns:

select s.*, t.* from (select GroupId, max(Rank) AS MaxRank       from (select GroupId, @Rank := @Rank + 1 AS Rank              from `Table`             order by OrderField             ) as t       group by GroupId) as t    join (       select *, @Rank := @Rank + 1 AS Rank       from `Table`       order by OrderField       ) as s    -- on t.GroupId = s.GroupId and t.MaxRank = s.Rank order by OrderField;  +---------+---------+---------+------------+------+------+ | GroupId | MaxRank | GroupId | OrderField | foo  | Rank | +---------+---------+---------+------------+------+------+ |      10 |       3 |      10 |         10 | NULL |    7 | |      20 |       6 |      10 |         10 | NULL |    7 | |      10 |       3 |      10 |         20 | NULL |    8 | |      20 |       6 |      10 |         20 | NULL |    8 | |      20 |       6 |      10 |         30 | foo  |    9 | |      10 |       3 |      10 |         30 | foo  |    9 | |      10 |       3 |      20 |         40 | NULL |   10 | |      20 |       6 |      20 |         40 | NULL |   10 | |      10 |       3 |      20 |         50 | NULL |   11 | |      20 |       6 |      20 |         50 | NULL |   11 | |      20 |       6 |      20 |         60 | foo  |   12 | |      10 |       3 |      20 |         60 | foo  |   12 | +---------+---------+---------+------------+------+------+

We can see from the above that the max rank per group is correct, but then the @Rank continues to increase as it processes the second derived table, to 7 and on higher. So the ranks from the second derived table will never overlap with the ranks from the first derived table at all.

You'd have to add another derived table to force @Rank to reset to zero in between processing the two tables (and hope the optimizer doesn't change the order in which it evaluates tables, or else use STRAIGHT_JOIN to prevent that):

select s.* from (select GroupId, max(Rank) AS MaxRank       from (select GroupId, @Rank := @Rank + 1 AS Rank              from `Table`             order by OrderField             ) as t       group by GroupId) as t    join (select @Rank := 0) r -- RESET @Rank TO ZERO HERE   join (       select *, @Rank := @Rank + 1 AS Rank       from `Table`       order by OrderField       ) as s    on t.GroupId = s.GroupId and t.MaxRank = s.Rank order by OrderField;  +---------+------------+------+------+ | GroupId | OrderField | foo  | Rank | +---------+------------+------+------+ |      10 |         30 | foo  |    3 | |      20 |         60 | foo  |    6 | +---------+------------+------+------+

But the optimization of this query is terrible. It can't use any indexes, it creates two temporary tables, sorts them the hard way, and even uses a join buffer because it can't use an index when joining temp tables either. This is example output from EXPLAIN:

+----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+ | id | select_type | table      | type   | possible_keys | key  | key_len | ref  | rows | Extra                           | +----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+ |  1 | PRIMARY     | <derived4> | system | NULL          | NULL | NULL    | NULL |    1 | Using temporary; Using filesort | |  1 | PRIMARY     | <derived2> | ALL    | NULL          | NULL | NULL    | NULL |    2 |                                 | |  1 | PRIMARY     | <derived5> | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using where; Using join buffer  | |  5 | DERIVED     | Table      | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using filesort                  | |  4 | DERIVED     | NULL       | NULL   | NULL          | NULL | NULL    | NULL | NULL | No tables used                  | |  2 | DERIVED     | <derived3> | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using temporary; Using filesort | |  3 | DERIVED     | Table      | ALL    | NULL          | NULL | NULL    | NULL |    6 | Using filesort                  | +----+-------------+------------+--------+---------------+------+---------+------+------+---------------------------------+

Whereas my solution using the left outer join optimizes much better. It uses no temp table and even reports "Using index" which means it can resolve the join using only the index, without touching the data.

+----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+ | id | select_type | table | type | possible_keys | key     | key_len | ref             | rows | Extra                    | +----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+ |  1 | SIMPLE      | t1    | ALL  | NULL          | NULL    | NULL    | NULL            |    6 | Using filesort           | |  1 | SIMPLE      | t2    | ref  | GroupId       | GroupId | 5       | test.t1.GroupId |    1 | Using where; Using index | +----+-------------+-------+------+---------------+---------+---------+-----------------+------+--------------------------+

You'll probably read people making claims on their blogs that "joins make SQL slow," but that's nonsense. Poor optimization makes SQL slow.

151

answered Oct 28 '22 08:10

Bill Karwin

Related questions
                            
                                Mysqldump only tables with certain prefix / Mysqldump wildcards?
                            
                                mysqld: Can't change dir to data. Server doesn't start
                            
                                How can I protect MySQL username and password from decompiling?
                            
                                How can I enable MySQL's slow query log without restarting MySQL?
                            
                                MySQL: Check if the user exists and drop it
                            
                                mysqldump exports only one table
                            
                                How to access a RowDataPacket object
                            
                                How do I rename an Index in MySQL
                            
                                Show tables by engine in MySQL
                            
                                Dump all tables in CSV format using 'mysqldump'
                            
                                MySQL Join Where Not Exists
                            
                                Biggest value from two or more fields
                            
                                Getting data for histogram plot
                            
                                Duplicate / Copy records in the same MySQL table
                            
                                JPA: how do I persist a String into a database field, type MYSQL Text
                            
                                Natural Sort in MySQL
                            
                                MySQL: Error Code: 1118 Row size too large (> 8126). Changing some columns to TEXT or BLOB
                            
                                How to export data from SQL Server 2005 to MySQL [closed]
                            
                                How to add AUTO_INCREMENT to an existing column?
                            
                                MySQL/Writing file error (Errcode 28)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Get records with highest/smallest <whatever> per group

Tags:

mysql

greatest-n-per-group

subquery

rank

Tomas

People also ask

1 Answers

Bill Karwin

Recent Activity

Donate For Us