Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting top rows with the latest related row in joined table quickly

There are two tables conversations and messages, I want to fetch conversations along with the content of their latest message.

conversations - id(PRIMARY KEY), name, created_at

messages - id, content, created_at, conversation_id

Currently we are running this query to get the required data

SELECT
    conversations.id,
    m.content AS last_message_content,
    m.created_at AS last_message_at
FROM
    conversations
INNER JOIN messages m ON conversations.id = m.conversation_id
                     AND m.id = (
    SELECT
        id
    FROM
        messages _m
    WHERE
        m.conversation_id = _m.conversation_id
    ORDER BY
        created_at DESC
    LIMIT 1)
ORDER BY
    last_message_at DESC
LIMIT 15
OFFSET 0

The above query is returning the valid data but its performance decreases with the increasing number of rows. Is there any other way to write this query with increased performance? Attaching the fiddle for example.

http://sqlfiddle.com/#!17/2decb/2

Also tried the suggestions in one of the deleted answers:

SELECT DISTINCT ON (c.id)
       c.id,
       m.content AS last_message_content,
       m.created_at AS last_message_at
  FROM conversations AS c
 INNER JOIN messages AS m
    ON c.id = m.conversation_id 
 ORDER BY c.id, m.created_at DESC
 LIMIT 15 OFFSET 0

http://sqlfiddle.com/#!17/2decb/5

But the problem with this query is it doesn't sort by m.created_at. I want the resultset to be sorted by m.created_at DESC

like image 472
Rajdeep Singh Avatar asked Nov 25 '21 19:11

Rajdeep Singh


People also ask

Can we use top with Group By clause?

Typically, these are accomplished using the TOP or LIMIT clause. Problem is, Top N result sets are limited to the highest values in the table, without any grouping. The GROUP BY clause can help with that, but it is limited to the single top result for each group.

How can I get the first row of multiple rows in SQL?

To do that, you can use the ROW_NUMBER() function. In OVER() , you specify the groups into which the rows should be divided ( PARTITION BY ) and the order in which the numbers should be assigned to the rows ( ORDER BY ).

What is the query to fetch the first record from the Employee table?

To get the first and last record, use UNION. LIMIT is also used to get the number of records you want.


Video Answer


9 Answers

Since no column other than id from conversations is selected in the result you can only query the messages table to reduce time (Query1). If there is a possibility to have any messages for which conversation_id is not available in conversations table and you don't want to select those then you can use second query.

Schema and insert statements:

CREATE TABLE conversations 
(
    id INT, 
    name VARCHAR(200), 
    created_at DATE
);

INSERT INTO conversations VALUES (1, 'CONV1', '1 DEC 2021');
INSERT INTO conversations VALUES (2, 'CONV2', '1 DEC 2021');

CREATE TABLE messages 
(
    id INT, 
    content VARCHAR(200), 
    created_at DATE, 
    conversation_id INT
);

INSERT INTO messages VALUES (1, 'TEST 3', '12 DEC 2021', 1);
INSERT INTO messages VALUES (2, 'TEST 2', '11 DEC 2021', 1);
INSERT INTO messages VALUES (3, 'TEST 1', '10 DEC 2021', 1);
INSERT INTO messages VALUES (4, 'TEST CONV2 1', '10 DEC 2021', 2);

Query:

 WITH Latest_Conversation_Messages AS
 ( 
     SELECT
         conversation_id, content, created_at, 
         ROW_NUMBER() OVER (PARTITION BY conversation_id ORDER BY created_at DESC) rn 
     FROM
         messages
 )
 SELECT
     conversation_id, content AS last_message_content,
     created_at AS last_message_at 
 FROM
     Latest_Conversation_Messages 
 WHERE
     rn = 1

Output:

conversation_id last_message_content last_message_at
1 TEST 3 2021-12-12
2 TEST CONV2 1 2021-12-10

Query2:

WITH Latest_Conversation_Messages AS
( 
    SELECT
        conversation_id, content, created_at , 
        ROW_NUMBER() OVER (PARTITION BY conversation_id ORDER BY created_at DESC) rn 
    FROM
        messages m
    WHERE
        EXISTS (SELECT 1 FROM conversations c 
                WHERE c.id = m.conversation_id)
)
SELECT 
    conversation_id, content AS last_message_content,
    created_at AS last_message_at 
FROM
    Latest_Conversation_Messages 
WHERE
    rn = 1

Output:

conversation_id last_message_content last_message_at
1 TEST 3 2021-12-12
2 TEST CONV2 1 2021-12-10

db<>fiddle here

like image 99
Kazi Mohammad Ali Nur Avatar answered Oct 21 '22 00:10

Kazi Mohammad Ali Nur


Using not exists sub-statement ,

select
conversations.id,
m.content AS last_message_content,
m.created_at AS last_message_at
from conversations,messages m
where conversations.id = m.conversation_id 
and not exists (select 1 from messages n 
where m.conversation_id = n.conversation_id 
and m.created_at < n.created_at)

create an index on conversation_id of table messages would be good.

like image 23
ElapsedSoul Avatar answered Oct 21 '22 00:10

ElapsedSoul


Before measuring performance and comparing explains from different queries it usually worth to try and mimic your production setup first or you may (and almost certainly will) get a misleading EXPLAIN path (e.g. seq scan is used instead of index scan on small tables as sequential will be faster than random IO in such case)

I tried to tackle it in this way:

First - generate 200K conversations within the past 30 days

insert into conversations(id, name, created_at)
select 
    generate_series, 
    'CONVERSATION_'||generate_series, 
    NOW() - (random() * (interval '30 days')) 
from generate_series(1, 200000);

Second - generate 2M messages randomly distributed among 200K conversations and then also intentionally created 5 more "most recent" messages for conversation with ID=999, so that the conversation 999 has to always appear on top of query result.

insert into messages(id, content, conversation_id, created_at)
select msg.id, content, conversation_id, created_at + (random() * (interval '7 days')) from (
    select distinct
        generate_series as id, 
        'Message content ' || generate_series as content,
        1 + trunc(random() * 200000) as conversation_id
    from generate_series(1, 2000000)
) msg
join conversations c on c.id = conversation_id;

insert into messages(id, content, conversation_id, created_at)
select 
    generate_series as id, 
    'Message content ' || generate_series as content,
    999 as conversation_id,
    now() + interval '7 day' + (random() * (interval '7 days'))
from generate_series(2000001, 2000006);

And now you can try and compare (now with a little bit more confidence) those EXPLAINs to see which query works better.

Assuming you added the proposed index

CREATE INDEX idx1 ON messages(conversation_id, created_at desc)
  • Both @GoonerForLife and @asinkxcoswt answers are pretty good, though the result is moderate because of window function usage

with cost=250000 on average and 2 to 3 seconds execution time on my machine

  • @SalmanA and @ESG answers are twice as fast even though the lateral join will force query planner to choose sequential scan (this is inevitable as the join is on TRUE, so no index could be used)

with cost ~150000 on average and 1-1.5 seconds execution time on my machine


It may not be obvious at first, but the @ElapsedSoul's answers with NOT EXISTS is quite close to ideal (though it still needs a couple of tweaks):

(1) It lacks the order by and limit:

select 
    conversations.id, 
    m.content AS last_message_content, 
    m.created_at AS last_message_at
from conversations,messages m
where conversations.id = m.conversation_id and not exists (
    select 1 from messages n 
    where m.conversation_id = n.conversation_id and m.created_at < n.created_at
) order by last_message_at desc
limit 15

And (2) Since there is date-to-date comparison inside the NOT EXISTS subquery - we have to add another index on massages table

CREATE INDEX ix2 ON messages(created_at desc);

After that we should get a decent performance gain. For example on my machine it resulted in 0.036ms execution time and 20.07 cost

like image 2
dshelya Avatar answered Oct 20 '22 23:10

dshelya


12 answers already. (!) And yet, here is number 13 with a query to outperform all of them.

All you need is an index on messages.created_at for this to be fast. Like:

CREATE INDEX ON messages (created_at DESC);

Works with ascending order, too, almost at the same speed

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT conversation_id, content, created_at, ARRAY[conversation_id] AS latest_ids
   FROM   messages
   ORDER  BY created_at DESC
   LIMIT  1
   )
   UNION ALL
   SELECT m.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT m.conversation_id, m.content, m.created_at, c.latest_ids || m.conversation_id
      FROM   messages m
      WHERE  m.created_at <= c.created_at
      AND    m.conversation_id <> ALL (c.latest_ids)
      ORDER  BY m.created_at DESC
      LIMIT  1
      ) m
   -- WHERE  cardinality(c.latest_ids) < 15  -- not necessary
   )
SELECT conversation_id
     , content    AS last_message_content
     , created_at AS last_message_at
FROM   cte
LIMIT  15;

db<>fiddle here

Why? How?

Key is your small LIMIT, taking only the 15 "latest" conversations. (The ones with the latest messages, to be precise.) So very few. There are better query styles to get all or most conversations with their respective latest message.

Your table is big (you mentioned an "increasing number of rows") so the all-essential virtue must be to avoid a sequential scan over the big table - or even an index-scan over the whole table. That's what this query achieves.

The query emulates an index-skip scan with a recursive CTE. The non-recursive part takes the latest message, the recursive part takes the next-latest message from a different conversation. latest_ids keeps track of the selected conversations to avoid dupes.

Postgres will stop recursion as soon as enough rows have been selected to meet the outer LIMIT. So the added, but commented, break condition in the recursive term is not needed.

Unless there are too many messages in each conversation, so that a large amount of rows would have to be filtered (seems extremely unlikely), this should be as good as it gets.

See:

  • Optimize GROUP BY query to retrieve latest row per user (paragraph 1a)
  • SELECT DISTINCT is slower than expected on my table in PostgreSQL
  • Select first row in each GROUP BY group?

Oh, and we don't need to involve the table conversations at all. We get conversation_id from table messages and that's all you want.

like image 2
Erwin Brandstetter Avatar answered Oct 20 '22 22:10

Erwin Brandstetter


Have you tried a lateral join instead?

SELECT
    conversations.id,
    m.content AS last_message_content,
    m.created_at AS last_message_at
FROM "conversations" 
INNER JOIN LATERAL (
  SELECT content, created_at 
  FROM  messages m
  WHERE conversations.id = m.conversation_id 
  ORDER BY created_at DESC
  FETCH FIRST 1 ROW ONLY
) m ON TRUE
ORDER BY last_message_at DESC
LIMIT 15 OFFSET 0
like image 1
ESG Avatar answered Oct 20 '22 22:10

ESG


Try this one.

WITH cte AS
(
    SELECT
        conversations.id,
        m.content AS last_message_content,
        m.created_at AS last_message_at,
        MAX(M.created_at) OVER(PARTITION BY conversations.id) = M.created_at AS isLatest
    FROM conversations 
    INNER JOIN messages m 
      ON conversations.id = m.conversation_id
)
SELECT ID, last_message_content, last_message_at
FROM cte 
WHERE isLatest
ORDER BY last_message_at DESC
LIMIT 15 OFFSET 0
like image 1
GoonerForLife Avatar answered Oct 21 '22 00:10

GoonerForLife


I second the answer with lateral view, and I may suggest a Transact-SQL variant with CROSS-APPLY

SELECT
    conversations.id,
    m.content AS last_message_content,
    m.created_at AS last_message_at
FROM "conversations" 
outer apply (
  SELECT top 1 content, created_at 
  FROM  messages m
  WHERE conversations.id = m.conversation_id 
  ORDER BY created_at DESC
) m
 ORDER BY
    last_message_at DESC
LIMIT 15 OFFSET 0 
like image 1
Igor N. Avatar answered Oct 20 '22 23:10

Igor N.


Another version worth trying

SELECT *
FROM (
SELECT
    cv.id,
    ms.content AS last_message_content,
    ms.created_at AS last_message_at,
    row_number() over (partition by cv.id order by ms.created_at desc) rank
FROM conversations cv
JOIN messages ms on (cv.id = ms.conversation_id)
) t
WHERE t.rank = 1
ORDER BY t.last_message_at DESC
LIMIT 15 OFFSET 0;

Since we only need the first n (15) conversations, if the database optimizer fails to figure out this (need to test against the actual data to see), it will cause full table scan.

To help the optimizer, we can tell it to select first n rows from conversations before joining.

WITH cv as (
SELECT id
, created_at
from conversations
ORDER BY created_at DESC
LIMIT 15 OFFSET 0
)
SELECT *
FROM (
SELECT
    cv.id,
    ms.content AS last_message_content,
    ms.created_at AS last_message_at,
    row_number() over (partition by cv.id order by ms.created_at desc) rank
FROM cv
JOIN messages ms on (cv.id = ms.conversation_id)
) t
WHERE t.rank = 1
ORDER BY t.last_message_at DESC
;

But with limitation that you will get the latest conversations instead of the latest messages. To fix this, you can add last_updated_at to the conversations.

Please note that an index on conversations.created_at (or last_updated_at if you want) is a critical factor here, please don't fotget to have one.

like image 1
asinkxcoswt Avatar answered Oct 20 '22 22:10

asinkxcoswt


Bringing together a few of the different ideas in here together, the select distinct on with a subquery for sort, and using only the messages table since it houses all the required information.

SELECT
conversation_id AS id,
content AS last_message_content,
created_at AS last_message_at
  FROM
     (SELECT DISTINCT ON (conversation_id)
         conversation_id,
         content,
         created_at
       FROM messages
       ORDER BY conversation_id, created_at DESC
       LIMIT 15 OFFSET 0) 
       as a
ORDER BY created_at DESC

http://sqlfiddle.com/#!17/2decb/134/0 - cost of 16.9

I'm not sure how big the actual dataset it, and while a partition as used by others may perform well on the sqlfiddle available it could cause some longer run times over a larger set.

You could also set up the above as a CTE, depending on preference -

WITH SUBQUERY AS
(SELECT DISTINCT ON (conversation_id)
   conversation_id,
   content,
   created_at
  FROM messages
  ORDER BY conversation_id, created_at DESC
  LIMIT 15 OFFSET 0)

SELECT
  conversation_id AS id,
  content AS last_message_content,
  created_at AS last_message_at
FROM SUBQUERY
ORDER BY created_at DESC

http://sqlfiddle.com/#!17/2decb/135/0 - cost of 17

If the join to conversations is required, this would work - http://sqlfiddle.com/#!17/2decb/144

like image 1
procopypaster Avatar answered Oct 20 '22 23:10

procopypaster