Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can query be optimized: Get a records max date then join the max date's values

Tags:

database

mysql

I've created a query that returns the results I want but I feel there must be a better way to do this. Any guidance would be appreciated.

I am trying to get all items for a specific meeting and join their max meeting date < X and join the max date's committee acronym. X is the current meeting date.

I've tried a few different queries but none, other than the one below, returned the expected results all the time.

You can see this query in action by going to rextester.

DROP TABLE IF EXISTS `committees`;
CREATE TABLE committees
    (`id` int, `acronym` varchar(4))
;

INSERT INTO committees
    (`id`, `acronym`)
VALUES
    (1, 'Com1'),
    (2, 'Com2'),
    (3, 'Com3')
;

DROP TABLE IF EXISTS `meetings`;
CREATE TABLE meetings
    (`id` int, `date` datetime, `committee_id` int)
;

INSERT INTO meetings
    (`id`, `date`, `committee_id`)
VALUES
    (1, '2017-01-01 00:00:00', 1),
    (2, '2017-02-02 00:00:00', 2),
    (3, '2017-03-03 00:00:00', 2)
;

DROP TABLE IF EXISTS `agenda_items`;
CREATE TABLE agenda_items
    (`id` int, `name` varchar(6))
;

INSERT INTO agenda_items
    (`id`, `name`)
VALUES
    (1, 'Item 1'),
    (2, 'Item 2'),
    (3, 'Item 3')
;

DROP TABLE IF EXISTS `join_agenda_items_meetings`;
CREATE TABLE join_agenda_items_meetings
    (`id` int, `agenda_item_id` int, `meeting_id` int)
;

INSERT INTO join_agenda_items_meetings
    (`id`, `agenda_item_id`, `meeting_id`)
VALUES
    (1, 1, 1),
    (2, 1, 2),
    (3, 2, 1),
    (4, 3, 2),
    (5, 2, 1),
    (6, 1, 3)
;




SELECT agenda_items.id, 
       meetings.id, 
       meetings.date, 
       sub_one.max_date, 
       sub_two.acronym 
FROM   agenda_items 
       LEFT JOIN (SELECT ai.id                AS ai_id, 
                         me.id                AS me_id, 
                         Max(me.date) AS max_date 
                  FROM   agenda_items AS ai 
                         JOIN join_agenda_items_meetings AS jaim 
                           ON jaim.agenda_item_id = ai.id 
                         JOIN meetings AS me 
                           ON me.id = jaim.meeting_id 
                  WHERE  me.date < '2017-02-02' 
                  GROUP  BY ai_id) sub_one 
              ON sub_one.ai_id = agenda_items.id 
       LEFT JOIN (SELECT agenda_items.id       AS age_id, 
                         meetings.date AS meet_date, 
                         committees.acronym    AS acronym 
                  FROM   agenda_items 
                         JOIN join_agenda_items_meetings 
                           ON join_agenda_items_meetings.agenda_item_id = agenda_items.id 
                         JOIN meetings 
                           ON meetings.id = join_agenda_items_meetings.meeting_id 
                         JOIN committees 
                           ON committees.id = meetings.committee_id 
                  WHERE  meetings.date) sub_two 
              ON sub_two.age_id = agenda_items.id 
                 AND sub_one.max_date = sub_two.meet_date 
       JOIN join_agenda_items_meetings 
         ON agenda_items.id = join_agenda_items_meetings.agenda_item_id 
       JOIN meetings 
         ON meetings.id = join_agenda_items_meetings.meeting_id 
WHERE  meetings.id = 2;

REVIEW / TESTING OF ANSWERS (REVISED):*

I've revised the testing based on the comments made.

Since I put a bounty on this question I felt I should show how I'm evaluating the answers and give some feedback. Overall I'm very grateful to all how have helped out, thank you.

For testing, I reviewed the queries against:

  • the initial rextester
  • a modified version of the initial rextester with all 4 queries for 2 separate datasets
  • a larger data set from my actual database

My Original Query with EXPLAIN

+----+-------------+---------------------------+------+----------------------------------------------+
| id | select_type | table                     | rows | Extra                                        |
+----+-------------+---------------------------+------+----------------------------------------------+
|  1 | PRIMARY     | meetings                  |    1 |                                              |
|  1 | PRIMARY     | join_agenda_item_meetings | 1976 | Using where; Using index                     |
|  1 | PRIMARY     | agenda_items              |    1 | Using index                                  |
|  1 | PRIMARY     | <derived2>                | 1087 |                                              |
|  1 | PRIMARY     | <derived3>                | 2202 |                                              |
|  3 | DERIVED     | join_agenda_item_meetings | 1976 | Using index                                  |
|  3 | DERIVED     | meetings                  |    1 | Using where                                  |
|  3 | DERIVED     | committees                |    1 |                                              |
|  3 | DERIVED     | agenda_items              |    1 | Using index                                  |
|  2 | DERIVED     | jaim                      | 1976 | Using index; Using temporary; Using filesort |
|  2 | DERIVED     | me                        |    1 | Using where                                  |
|  2 | DERIVED     | ai                        |    1 | Using index                                  |
+----+-------------+---------------------------+------+----------------------------------------------+
12 rows in set (0.02 sec)

Paul Spiegel's answers.

The initial answer works and seems to be the most efficient option presented, much more than mine.

Paul Spiegel's first query pulls the fewest rows, is shorter and more readable than mine. It also doesn't need to reference a date which will be nicer when writing it as well.

+----+--------------------+-------+------+--------------------------+
| id | select_type        | table | rows | Extra                    |
+----+--------------------+-------+------+--------------------------+
|  1 | PRIMARY            | m1    |    1 |                          |
|  1 | PRIMARY            | am1   | 1976 | Using where; Using index |
|  1 | PRIMARY            | am2   |    1 | Using index              |
|  1 | PRIMARY            | m2    |    1 |                          |
|  2 | DEPENDENT SUBQUERY | am3   |    1 | Using index              |
|  2 | DEPENDENT SUBQUERY | m3    |    1 | Using where              |
|  2 | DEPENDENT SUBQUERY | c3    |    1 | Using where              |
+----+--------------------+-------+------+--------------------------+
7 rows in set (0.00 sec)

This query also returns the correct results when adding DISTINCT to the select statement. This query does not perform as well as the first though (but it is close).

+----+-------------+------------++------+-------------------------+
| id | select_type | table      | rows | Extra                    |
+----+-------------+------------++------+-------------------------+
|  1 | PRIMARY     | <derived2> |    5 | Using temporary          |
|  1 | PRIMARY     | am         |    1 | Using index              |
|  1 | PRIMARY     | m          |    1 |                          |
|  1 | PRIMARY     | c          |    1 | Using where              |
|  2 | DERIVED     | m1         |    1 |                          |
|  2 | DERIVED     | am1        | 1787 | Using where; Using index |
|  2 | DERIVED     | am2        |    1 | Using index              |
|  2 | DERIVED     | m2         |    1 |                          |
+----+-------------+------------+------+--------------------------+
8 rows in set (0.00 sec)

Stefano Zanini's answer

This query does return the expected results using DISTINCT. When using EXPLAIN and the number of rows being pulled this query is more efficient when compared to my original one but Paul Spiegel's is just a bit better.

+----+-------------+------------+------+---------------------------------+
| id | select_type | table      | rows | Extra                           |
+----+-------------+------------+------+---------------------------------+
|  1 | PRIMARY     | me         |    1 | Using temporary; Using filesort |
|  1 | PRIMARY     | rel        | 1787 | Using where; Using index        |
|  1 | PRIMARY     | <derived2> | 1087 |                                 |
|  1 | PRIMARY     | rel2       |    1 | Using index                     |
|  1 | PRIMARY     | me2        |    1 | Using where                     |
|  1 | PRIMARY     | co         |    1 |                                 |
|  2 | DERIVED     | t1         | 1787 | Using index                     |
|  2 | DERIVED     | t2         |    1 | Using where                     |
+----+-------------+------------+------+---------------------------------+
8 rows in set (0.00 sec)

EoinS' answer

As noted in the comments, this answer works if meetings are sequential, but they may not be unfortunately.

like image 577
user3366016 Avatar asked Mar 02 '17 19:03

user3366016


3 Answers

This one is a bit crazy.. Let's do it step by step:

The first step is a basic join

set @meeting_id = 2;

select am1.meeting_id,
       am1.agenda_item_id,
       m1.date as meeting_date
from meetings m1
join join_agenda_items_meetings am1 on am1.meeting_id = m1.id
where m1.id = @meeting_id;

We select the meeting (id = 2) and the corresponding agenda_item_ids. This will already return the rows we need with the first three columns.

Next step is to get the last meeting date for every agenda item. We need to join the first query with the join table and corresponding meetings (except of the one with id = 2 - am2.meeting_id <> am1.meeting_id). We only want meetings with a date before the actual meeting (m2.date < m1.date). From all those meetings we only want the latest date each agenda item. So we group by the agenda item and select max(m2.date):

select am1.meeting_id,
       am1.agenda_item_id,
       m1.date as meeting_date,
       max(m2.date) as max_date
from meetings m1
join join_agenda_items_meetings am1 on am1.meeting_id = m1.id
left join join_agenda_items_meetings am2 
    on  am2.agenda_item_id = am1.agenda_item_id
    and am2.meeting_id <> am1.meeting_id
left join meetings m2 
    on  m2.id = am2.meeting_id
    and m2.date < m1.date
where m1.id = @meeting_id
group by m1.id, am1.agenda_item_id;

This way we get the fourth column (max_date).

Last step is to select the acronym of the meeting with the last date (max_date). And this is the crazy part - We can use a correlated subquery in the SELECT clause. And we can use max(m2.date) for the correlation:

select c3.acronym
from meetings m3
join join_agenda_items_meetings am3 on am3.meeting_id = m3.id
join committees c3 on c3.id = m3.committee_id
where am3.agenda_item_id = am2.agenda_item_id
  and m3.date = max(m2.date)

The final query would be:

select am1.meeting_id,
       am1.agenda_item_id,
       m1.date as meeting_date,
       max(m2.date) as max_date,
       (   select c3.acronym
           from meetings m3
           join join_agenda_items_meetings am3 on am3.meeting_id = m3.id
           join committees c3 on c3.id = m3.committee_id
           where am3.agenda_item_id = am2.agenda_item_id
             and m3.date = max(m2.date)
       ) as acronym
from meetings m1
join join_agenda_items_meetings am1 on am1.meeting_id = m1.id
left join join_agenda_items_meetings am2 
    on  am2.agenda_item_id = am1.agenda_item_id
    and am2.meeting_id <> am1.meeting_id
left join meetings m2 
    on  m2.id = am2.meeting_id
    and m2.date < m1.date
where m1.id = @meeting_id
group by m1.id, am1.agenda_item_id;

http://rextester.com/JKK60222

To be true, i was surprised that you can use max(m2.date) in the subquery.

Another solution - Use the second query in a subquery (derived table). Join committees over meetings and the join table using max_date. Only keep rows with an acronym and rows without a max_date.

select t.*, c.acronym
from (
    select am1.meeting_id,
           am1.agenda_item_id,
           m1.date as meeting_date,
           max(m2.date) as max_date
    from meetings m1
    join join_agenda_items_meetings am1 on am1.meeting_id = m1.id
    left join join_agenda_items_meetings am2 
        on  am2.agenda_item_id = am1.agenda_item_id
        and am2.meeting_id <> am1.meeting_id
    left join meetings m2 
        on  m2.id = am2.meeting_id
        and m2.date < m1.date
    where m1.id = @meeting_id
    group by m1.id, am1.agenda_item_id
) t
left join join_agenda_items_meetings am
    on  am.agenda_item_id = t.agenda_item_id
    and t.max_date is not null
left join meetings m
    on  m.id   = am.meeting_id
    and m.date = t.max_date
left join committees c on c.id = m.committee_id
where t.max_date is null or c.acronym is not null;

http://rextester.com/BBMDFL23101

like image 80
Paul Spiegel Avatar answered Oct 14 '22 02:10

Paul Spiegel


Using your schema I used the below query, assuming that all meetings entries are sequential:

 set @mymeeting = 2;

 select j.agenda_item_id, m.id, m.date, mp.date, c.acronym
 from meetings m 
 left join join_agenda_items_meetings j on j.meeting_id = m.id
 left join join_agenda_items_meetings jp on jp.meeting_id = m.id -1 and jp.agenda_item_id = j.agenda_item_id
 left join meetings mp on mp.id = jp.meeting_id
 left join committees c on mp.committee_id = c.id
 where m.id = @mymeeting;

I create a variable just to make it easy to change meetings on the fly.

Here is a functional example in Rextester

Thanks for making your schema so easy to reproduce!

like image 24
EoinS Avatar answered Oct 14 '22 02:10

EoinS


I found this problem quite challenging, and the results I achieved are not jaw-dropping, but I managed to get rid of one of the sub-queries and maybe of a few joins, and this is result:

select    distinct me.ID, me.DATE, rel.AGENDA_ITEM_ID, sub.MAX_DATE, co.ACRONYM
from      MEETINGS me
join      JOIN_AGENDA_ITEMS_MEETINGS rel /* Note 1*/
  on      me.ID = rel.MEETING_ID
left join (   
              select  t1.AGENDA_ITEM_ID, max(t2.DATE) MAX_DATE
              from    JOIN_AGENDA_ITEMS_MEETINGS t1
              join    MEETINGS t2
                on    t2.ID = t1.MEETING_ID
              where   t2.DATE < '2017-02-02'
              group by t1.AGENDA_ITEM_ID
          ) sub
  on      rel.AGENDA_ITEM_ID = sub.AGENDA_ITEM_ID /* Note 2 */
left join JOIN_AGENDA_ITEMS_MEETINGS rel2
  on      rel2.AGENDA_ITEM_ID = rel.AGENDA_ITEM_ID /* Note 3 */
left join MEETINGS me2
  on      rel2.MEETING_ID = me2.ID and
          sub.MAX_DATE = me2.DATE /* Note 4 */
left join COMMITTEES co
  on      co.ID = me2.COMMITTEE_ID
where     me.ID = 2 and
          (sub.MAX_DATE is null or me2.DATE is not null) /* Note 5 */
order by  rel.AGENDA_ITEM_ID, rel2.MEETING_ID;

Notes

  1. you don't need the join with AGENDA_ITEMS, since the ID is already available in the relationship table

  2. up to this point we have current meeting, its agenda items and their "calculated" max date

  3. we get all meetings of each agenda item...

  4. ...so that we can pick the meeting whom date matches the max date we calculated previously

  5. this condition is needed because all the joins from rel2 on have to be left (because some agenda item may have no previous meeting and hence MAX_DATE = null) but this way me2 would give some agenda items undesired meetings.

like image 21
Stefano Zanini Avatar answered Oct 14 '22 03:10

Stefano Zanini