Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL order by before group by

There are plenty of similar questions to be found on here but I don't think that any answer the question adequately.

I'll continue from the current most popular question and use their example if that's alright.

The task in this instance is to get the latest post for each author in the database.

The example query produces unusable results as its not always the latest post that is returned.

SELECT wp_posts.* FROM wp_posts
    WHERE wp_posts.post_status='publish'
    AND wp_posts.post_type='post'
    GROUP BY wp_posts.post_author           
    ORDER BY wp_posts.post_date DESC

The current accepted answer is

SELECT
    wp_posts.*
FROM wp_posts
WHERE
    wp_posts.post_status='publish'
    AND wp_posts.post_type='post'
GROUP BY wp_posts.post_author
HAVING wp_posts.post_date = MAX(wp_posts.post_date) <- ONLY THE LAST POST FOR EACH AUTHOR
ORDER BY wp_posts.post_date DESC

Unfortunately this answer is plain and simple wrong and in many cases produces less stable results than the orginal query.

My best solution is to use a subquery of the form

SELECT wp_posts.* FROM 
(
    SELECT * 
    FROM wp_posts
    ORDER BY wp_posts.post_date DESC
) AS wp_posts
WHERE wp_posts.post_status='publish'
AND wp_posts.post_type='post'
GROUP BY wp_posts.post_author 

My question is a simple one then: Is there anyway to order rows before grouping without resorting to a subquery?

Edit: This question was a continuation from another question and the specifics of my situation are slightly different. You can (and should) assume that there is also a wp_posts.id that is a unique identifier for that particular post.

like image 557
Rob Forrest Avatar asked Sep 25 '22 17:09

Rob Forrest


People also ask

Can you ORDER BY before GROUP BY?

When combining the Group By and Order By clauses, it is important to bear in mind that, in terms of placement within a SELECT statement: The GROUP BY clause is placed after the WHERE clause. The GROUP BY clause is placed before the ORDER BY clause.

Can I use ORDER BY and GROUP BY together?

As of SQL standarts, in this case ORDER BY must influence only results of GROUP BY , not data before grouping.

Which executes first ORDER BY or GROUP BY?

GROUP BY clause is used with the SELECT statement. In the query, GROUP BY clause is placed after the WHERE clause. In the query, GROUP BY clause is placed before ORDER BY clause if used any.

Can you ORDER BY and GROUP BY in SQL?

Order By and Group By Clause in SQLGroup By in SQL is used to arrange similar data into groups and Order By in SQL is used to sort the data in ascending or descending order.


2 Answers

Using an ORDER BY in a subquery is not the best solution to this problem.

The best solution to get the max(post_date) by author is to use a subquery to return the max date and then join that to your table on both the post_author and the max date.

The solution should be:

SELECT p1.* 
FROM wp_posts p1
INNER JOIN
(
    SELECT max(post_date) MaxPostDate, post_author
    FROM wp_posts
    WHERE post_status='publish'
       AND post_type='post'
    GROUP BY post_author
) p2
  ON p1.post_author = p2.post_author
  AND p1.post_date = p2.MaxPostDate
WHERE p1.post_status='publish'
  AND p1.post_type='post'
order by p1.post_date desc

If you have the following sample data:

CREATE TABLE wp_posts
    (`id` int, `title` varchar(6), `post_date` datetime, `post_author` varchar(3))
;

INSERT INTO wp_posts
    (`id`, `title`, `post_date`, `post_author`)
VALUES
    (1, 'Title1', '2013-01-01 00:00:00', 'Jim'),
    (2, 'Title2', '2013-02-01 00:00:00', 'Jim')
;

The subquery is going to return the max date and author of:

MaxPostDate | Author
2/1/2013    | Jim

Then since you are joining that back to the table, on both values you will return the full details of that post.

See SQL Fiddle with Demo.

To expand on my comments about using a subquery to accurate return this data.

MySQL does not force you to GROUP BY every column that you include in the SELECT list. As a result, if you only GROUP BY one column but return 10 columns in total, there is no guarantee that the other column values which belong to the post_author that is returned. If the column is not in a GROUP BY MySQL chooses what value should be returned.

Using the subquery with the aggregate function will guarantee that the correct author and post is returned every time.

As a side note, while MySQL allows you to use an ORDER BY in a subquery and allows you to apply a GROUP BY to not every column in the SELECT list this behavior is not allowed in other databases including SQL Server.

like image 433
Taryn Avatar answered Oct 26 '22 04:10

Taryn


Your solution makes use of an extension to GROUP BY clause that permits to group by some fields (in this case, just post_author):

GROUP BY wp_posts.post_author

and select nonaggregated columns:

SELECT wp_posts.*

that are not listed in the group by clause, or that are not used in an aggregate function (MIN, MAX, COUNT, etc.).

Correct use of extension to GROUP BY clause

This is useful when all values of non-aggregated columns are equal for every row.

For example, suppose you have a table GardensFlowers (name of the garden, flower that grows in the garden):

INSERT INTO GardensFlowers VALUES
('Central Park',       'Magnolia'),
('Hyde Park',          'Tulip'),
('Gardens By The Bay', 'Peony'),
('Gardens By The Bay', 'Cherry Blossom');

and you want to extract all the flowers that grows in a garden, where multiple flowers grow. Then you have to use a subquery, for example you could use this:

SELECT GardensFlowers.*
FROM   GardensFlowers
WHERE  name IN (SELECT   name
                FROM     GardensFlowers
                GROUP BY name
                HAVING   COUNT(DISTINCT flower)>1);

If you need to extract all the flowers that are the only flowers in the garder instead, you could just change the HAVING condition to HAVING COUNT(DISTINCT flower)=1, but MySql also allows you to use this:

SELECT   GardensFlowers.*
FROM     GardensFlowers
GROUP BY name
HAVING   COUNT(DISTINCT flower)=1;

no subquery, not standard SQL, but simpler.

Incorrect use of extension to GROUP BY clause

But what happens if you SELECT non-aggregated columns that are non equal for every row? Which is the value that MySql chooses for that column?

It looks like MySql always chooses the FIRST value it encounters.

To make sure that the first value it encounters is exactly the value you want, you need to apply a GROUP BY to an ordered query, hence the need to use a subquery. You can't do it otherwise.

Given the assumption that MySql always chooses the first row it encounters, you are correcly sorting the rows before the GROUP BY. But unfortunately, if you read the documentation carefully, you'll notice that this assumption is not true.

When selecting non-aggregated columns that are not always the same, MySql is free to choose any value, so the resulting value that it actually shows is indeterminate.

I see that this trick to get the first value of a non-aggregated column is used a lot, and it usually/almost always works, I use it as well sometimes (at my own risk). But since it's not documented, you can't rely on this behaviour.

This link (thanks ypercube!) GROUP BY trick has been optimized away shows a situation in which the same query returns different results between MySql and MariaDB, probably because of a different optimization engine.

So, if this trick works, it's just a matter of luck.

The accepted answer on the other question looks wrong to me:

HAVING wp_posts.post_date = MAX(wp_posts.post_date)

wp_posts.post_date is a non-aggregated column, and its value will be officially undetermined, but it will likely be the first post_date encountered. But since the GROUP BY trick is applied to an unordered table, it is not sure which is the first post_date encountered.

It will probably returns posts that are the only posts of a single author, but even this is not always certain.

A possible solution

I think that this could be a possible solution:

SELECT wp_posts.*
FROM   wp_posts
WHERE  id IN (
  SELECT max(id)
  FROM wp_posts
  WHERE (post_author, post_date) = (
    SELECT   post_author, max(post_date)
    FROM     wp_posts
    WHERE    wp_posts.post_status='publish'
             AND wp_posts.post_type='post'
    GROUP BY post_author
  ) AND wp_posts.post_status='publish'
    AND wp_posts.post_type='post'
  GROUP BY post_author
)

On the inner query I'm returning the maximum post date for every author. I'm then taking into consideration the fact that the same author could theorically have two posts at the same time, so I'm getting only the maximum ID. And then I'm returning all rows that have those maximum IDs. It could be made faster using joins instead of IN clause.

(If you're sure that ID is only increasing, and if ID1 > ID2 also means that post_date1 > post_date2, then the query could be made much more simple, but I'm not sure if this is the case).

like image 27
fthiella Avatar answered Oct 26 '22 04:10

fthiella