There is a table <code>messages</code> that contains data as shown below: <pre class="prettyprint"><code>Id Name Other_Columns ------------------------- 1 A A_data_1 2 A A_data_2 3 A A_data_3 4 B B_data_1 5 B B_data_2 6 C C_data_1 </code></pre> If I run a query <code>select * from messages group by name</code>, I will get the result as: <pre class="prettyprint"><code>1 A A_data_1 4 B B_data_1 6 C C_data_1 </code></pre> What query will return the following result? <pre class="prettyprint"><code>3 A A_data_3 5 B B_data_2 6 C C_data_1 </code></pre> That is, the last record in each group should be returned. At present, this is the query that I use: <pre class="prettyprint"><code>SELECT * FROM (SELECT * FROM messages ORDER BY id DESC) AS x GROUP BY name </code></pre> But this looks highly inefficient. Any other ways to achieve the same result?

UPD: 2017-03-31, the version 5.7.5 of MySQL made the ONLY_FULL_GROUP_BY switch enabled by default (hence, non-deterministic GROUP BY queries became disabled). Moreover, they updated the GROUP BY implementation and the solution might not work as expected anymore even with the disabled switch. One needs to check. Bill Karwin's solution above works fine when item count within groups is rather small, but the performance of the query becomes bad when the groups are rather large, since the solution requires about <code>n*n/2 + n/2</code> of only <code>IS NULL</code> comparisons. I made my tests on a InnoDB table of <code>18684446</code> rows with <code>1182</code> groups. The table contains testresults for functional tests and has the <code>(test_id, request_id)</code> as the primary key. Thus, <code>test_id</code> is a group and I was searching for the last <code>request_id</code> for each <code>test_id</code>. Bill's solution has already been running for several hours on my dell e4310 and I do not know when it is going to finish even though it operates on a coverage index (hence <code>using index</code> in EXPLAIN). I have a couple of other solutions that are based on the same ideas: <ul> <li>if the underlying index is BTREE index (which is usually the case), the largest <code>(group_id, item_value)</code> pair is the last value within each <code>group_id</code>, that is the first for each <code>group_id</code> if we walk through the index in descending order;</li> <li>if we read the values which are covered by an index, the values are read in the order of the index;</li> <li>each index implicitly contains primary key columns appended to that (that is the primary key is in the coverage index). In solutions below I operate directly on the primary key, in you case, you will just need to add primary key columns in the result.</li> <li>in many cases it is much cheaper to collect the required row ids in the required order in a subquery and join the result of the subquery on the id. Since for each row in the subquery result MySQL will need a single fetch based on primary key, the subquery will be put first in the join and the rows will be output in the order of the ids in the subquery (if we omit explicit ORDER BY for the join)</li> </ul> 3 ways MySQL uses indexes is a great article to understand some details. Solution 1 This one is incredibly fast, it takes about 0,8 secs on my 18M+ rows: <pre class="prettyprint"><code>SELECT test_id, MAX(request_id) AS request_id FROM testresults GROUP BY test_id DESC; </code></pre> If you want to change the order to ASC, put it in a subquery, return the ids only and use that as the subquery to join to the rest of the columns: <pre class="prettyprint"><code>SELECT test_id, request_id FROM ( SELECT test_id, MAX(request_id) AS request_id FROM testresults GROUP BY test_id DESC) as ids ORDER BY test_id; </code></pre> This one takes about 1,2 secs on my data. Solution 2 Here is another solution that takes about 19 seconds for my table: <pre class="prettyprint"><code>SELECT test_id, request_id FROM testresults, (SELECT @group:=NULL) as init WHERE IF(IFNULL(@group, -1)=@group:=test_id, 0, 1) ORDER BY test_id DESC, request_id DESC </code></pre> It returns tests in descending order as well. It is much slower since it does a full index scan but it is here to give you an idea how to output N max rows for each group. The disadvantage of the query is that its result cannot be cached by the query cache.

Retrieving the last record in each group - MySQL

Tags:

sql

mysql

group-by

greatest-n-per-group

There is a table messages that contains data as shown below:

Id   Name   Other_Columns ------------------------- 1    A       A_data_1 2    A       A_data_2 3    A       A_data_3 4    B       B_data_1 5    B       B_data_2 6    C       C_data_1

If I run a query select * from messages group by name, I will get the result as:

1    A       A_data_1 4    B       B_data_1 6    C       C_data_1

What query will return the following result?

3    A       A_data_3 5    B       B_data_2 6    C       C_data_1

That is, the last record in each group should be returned.

At present, this is the query that I use:

SELECT   * FROM (SELECT   * FROM messages ORDER BY id DESC) AS x GROUP BY name

But this looks highly inefficient. Any other ways to achieve the same result?

430

asked Aug 21 '09 17:08

Vijay Dev

2 Answers

MySQL 8.0 now supports windowing functions, like almost all popular SQL implementations. With this standard syntax, we can write greatest-n-per-group queries:

WITH ranked_messages AS (   SELECT m.*, ROW_NUMBER() OVER (PARTITION BY name ORDER BY id DESC) AS rn   FROM messages AS m ) SELECT * FROM ranked_messages WHERE rn = 1;

Below is the original answer I wrote for this question in 2009:

I write the solution this way:

SELECT m1.* FROM messages m1 LEFT JOIN messages m2  ON (m1.name = m2.name AND m1.id < m2.id) WHERE m2.id IS NULL;

Regarding performance, one solution or the other can be better, depending on the nature of your data. So you should test both queries and use the one that is better at performance given your database.

For example, I have a copy of the StackOverflow August data dump. I'll use that for benchmarking. There are 1,114,357 rows in the Posts table. This is running on MySQL 5.0.75 on my Macbook Pro 2.40GHz.

I'll write a query to find the most recent post for a given user ID (mine).

First using the technique shown by @Eric with the GROUP BY in a subquery:

SELECT p1.postid FROM Posts p1 INNER JOIN (SELECT pi.owneruserid, MAX(pi.postid) AS maxpostid             FROM Posts pi GROUP BY pi.owneruserid) p2   ON (p1.postid = p2.maxpostid) WHERE p1.owneruserid = 20860;  1 row in set (1 min 17.89 sec)

Even the EXPLAIN analysis takes over 16 seconds:

+----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+ | id | select_type | table      | type   | possible_keys              | key         | key_len | ref          | rows    | Extra       | +----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+ |  1 | PRIMARY     | <derived2> | ALL    | NULL                       | NULL        | NULL    | NULL         |   76756 |             |  |  1 | PRIMARY     | p1         | eq_ref | PRIMARY,PostId,OwnerUserId | PRIMARY     | 8       | p2.maxpostid |       1 | Using where |  |  2 | DERIVED     | pi         | index  | NULL                       | OwnerUserId | 8       | NULL         | 1151268 | Using index |  +----+-------------+------------+--------+----------------------------+-------------+---------+--------------+---------+-------------+ 3 rows in set (16.09 sec)

Now produce the same query result using my technique with LEFT JOIN:

SELECT p1.postid FROM Posts p1 LEFT JOIN posts p2   ON (p1.owneruserid = p2.owneruserid AND p1.postid < p2.postid) WHERE p2.postid IS NULL AND p1.owneruserid = 20860;  1 row in set (0.28 sec)

The EXPLAIN analysis shows that both tables are able to use their indexes:

+----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+ | id | select_type | table | type | possible_keys              | key         | key_len | ref   | rows | Extra                                | +----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+ |  1 | SIMPLE      | p1    | ref  | OwnerUserId                | OwnerUserId | 8       | const | 1384 | Using index                          |  |  1 | SIMPLE      | p2    | ref  | PRIMARY,PostId,OwnerUserId | OwnerUserId | 8       | const | 1384 | Using where; Using index; Not exists |  +----+-------------+-------+------+----------------------------+-------------+---------+-------+------+--------------------------------------+ 2 rows in set (0.00 sec)

Here's the DDL for my Posts table:

CREATE TABLE `posts` (   `PostId` bigint(20) unsigned NOT NULL auto_increment,   `PostTypeId` bigint(20) unsigned NOT NULL,   `AcceptedAnswerId` bigint(20) unsigned default NULL,   `ParentId` bigint(20) unsigned default NULL,   `CreationDate` datetime NOT NULL,   `Score` int(11) NOT NULL default '0',   `ViewCount` int(11) NOT NULL default '0',   `Body` text NOT NULL,   `OwnerUserId` bigint(20) unsigned NOT NULL,   `OwnerDisplayName` varchar(40) default NULL,   `LastEditorUserId` bigint(20) unsigned default NULL,   `LastEditDate` datetime default NULL,   `LastActivityDate` datetime default NULL,   `Title` varchar(250) NOT NULL default '',   `Tags` varchar(150) NOT NULL default '',   `AnswerCount` int(11) NOT NULL default '0',   `CommentCount` int(11) NOT NULL default '0',   `FavoriteCount` int(11) NOT NULL default '0',   `ClosedDate` datetime default NULL,   PRIMARY KEY  (`PostId`),   UNIQUE KEY `PostId` (`PostId`),   KEY `PostTypeId` (`PostTypeId`),   KEY `AcceptedAnswerId` (`AcceptedAnswerId`),   KEY `OwnerUserId` (`OwnerUserId`),   KEY `LastEditorUserId` (`LastEditorUserId`),   KEY `ParentId` (`ParentId`),   CONSTRAINT `posts_ibfk_1` FOREIGN KEY (`PostTypeId`) REFERENCES `posttypes` (`PostTypeId`) ) ENGINE=InnoDB;

Note to commenters: If you want another benchmark with a different version of MySQL, a different dataset, or different table design, feel free to do it yourself. I have shown the technique above. Stack Overflow is here to show you how to do software development work, not to do all the work for you.

158

answered Oct 14 '22 15:10

Bill Karwin

UPD: 2017-03-31, the version 5.7.5 of MySQL made the ONLY_FULL_GROUP_BY switch enabled by default (hence, non-deterministic GROUP BY queries became disabled). Moreover, they updated the GROUP BY implementation and the solution might not work as expected anymore even with the disabled switch. One needs to check.

Bill Karwin's solution above works fine when item count within groups is rather small, but the performance of the query becomes bad when the groups are rather large, since the solution requires about n*n/2 + n/2 of only IS NULL comparisons.

I made my tests on a InnoDB table of 18684446 rows with 1182 groups. The table contains testresults for functional tests and has the (test_id, request_id) as the primary key. Thus, test_id is a group and I was searching for the last request_id for each test_id.

Bill's solution has already been running for several hours on my dell e4310 and I do not know when it is going to finish even though it operates on a coverage index (hence using index in EXPLAIN).

I have a couple of other solutions that are based on the same ideas:

if the underlying index is BTREE index (which is usually the case), the largest (group_id, item_value) pair is the last value within each group_id, that is the first for each group_id if we walk through the index in descending order;
if we read the values which are covered by an index, the values are read in the order of the index;
each index implicitly contains primary key columns appended to that (that is the primary key is in the coverage index). In solutions below I operate directly on the primary key, in you case, you will just need to add primary key columns in the result.
in many cases it is much cheaper to collect the required row ids in the required order in a subquery and join the result of the subquery on the id. Since for each row in the subquery result MySQL will need a single fetch based on primary key, the subquery will be put first in the join and the rows will be output in the order of the ids in the subquery (if we omit explicit ORDER BY for the join)

3 ways MySQL uses indexes is a great article to understand some details.

Solution 1

This one is incredibly fast, it takes about 0,8 secs on my 18M+ rows:

SELECT test_id, MAX(request_id) AS request_id FROM testresults GROUP BY test_id DESC;

If you want to change the order to ASC, put it in a subquery, return the ids only and use that as the subquery to join to the rest of the columns:

SELECT test_id, request_id FROM (     SELECT test_id, MAX(request_id) AS request_id     FROM testresults     GROUP BY test_id DESC) as ids ORDER BY test_id;

This one takes about 1,2 secs on my data.

Solution 2

Here is another solution that takes about 19 seconds for my table:

SELECT test_id, request_id FROM testresults, (SELECT @group:=NULL) as init WHERE IF(IFNULL(@group, -1)=@group:=test_id, 0, 1) ORDER BY test_id DESC, request_id DESC

It returns tests in descending order as well. It is much slower since it does a full index scan but it is here to give you an idea how to output N max rows for each group.

The disadvantage of the query is that its result cannot be cached by the query cache.

answered Oct 14 '22 16:10

newtover

Related questions
                            
                                SQL JOIN - WHERE clause vs. ON clause
                            
                                DateTime2 vs DateTime in SQL Server
                            
                                Reset identity seed after deleting records in SQL Server
                            
                                Should I use != or <> for not equal in T-SQL?
                            
                                How can I list all foreign keys referencing a given table in SQL Server?
                            
                                How can I SELECT rows with MAX(Column value), PARTITION by another column in MYSQL?
                            
                                How to Join to first row
                            
                                How can I get column names from a table in SQL Server?
                            
                                Exclude a column using SELECT * [except columnA] FROM tableA?
                            
                                Function vs. Stored Procedure in SQL Server
                            
                                Search text in stored procedure in SQL Server
                            
                                Join vs. sub-query
                            
                                How can I do 'insert if not exists' in MySQL?
                            
                                Insert into a MySQL table or update if exists
                            
                                Save PL/pgSQL output from PostgreSQL to a CSV file
                            
                                When should I use CROSS APPLY over INNER JOIN?
                            
                                INNER JOIN ON vs WHERE clause
                            
                                SQL update from one Table to another based on a ID match
                            
                                Parameterize an SQL IN clause
                            
                                Get list of all tables in Oracle?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With