<p>There are two tables <code>conversations</code> and <code>messages</code>, I want to fetch conversations along with the content of their latest message.</p> <p><code>conversations</code> - id(PRIMARY KEY), name, created_at</p> <p><code>messages</code> - id, content, created_at, conversation_id</p> <p>Currently we are running this query to get the required data</p> <pre class="prettyprint"><code>SELECT conversations.id, m.content AS last_message_content, m.created_at AS last_message_at FROM conversations INNER JOIN messages m ON conversations.id = m.conversation_id AND m.id = ( SELECT id FROM messages _m WHERE m.conversation_id = _m.conversation_id ORDER BY created_at DESC LIMIT 1) ORDER BY last_message_at DESC LIMIT 15 OFFSET 0 </code></pre> <p>The above query is returning the valid data but its performance decreases with the increasing number of rows. Is there any other way to write this query with increased performance? Attaching the fiddle for example.</p> <p>http://sqlfiddle.com/#!17/2decb/2</p> <p>Also tried the suggestions in one of the deleted answers:</p> <pre class="prettyprint"><code>SELECT DISTINCT ON (c.id) c.id, m.content AS last_message_content, m.created_at AS last_message_at FROM conversations AS c INNER JOIN messages AS m ON c.id = m.conversation_id ORDER BY c.id, m.created_at DESC LIMIT 15 OFFSET 0 </code></pre> <p>http://sqlfiddle.com/#!17/2decb/5</p> <p>But the problem with this query is it doesn't sort by <code>m.created_at</code>. I want the resultset to be sorted by <code>m.created_at DESC</code></p>

<p>Since no column other than id from conversations is selected in the result you can only query the messages table to reduce time (Query1). If there is a possibility to have any messages for which <code>conversation_id</code> is not available in conversations table and you don't want to select those then you can use second query.</p> <p>Schema and insert statements:</p> <pre class="prettyprint"><code>CREATE TABLE conversations ( id INT, name VARCHAR(200), created_at DATE ); INSERT INTO conversations VALUES (1, 'CONV1', '1 DEC 2021'); INSERT INTO conversations VALUES (2, 'CONV2', '1 DEC 2021'); CREATE TABLE messages ( id INT, content VARCHAR(200), created_at DATE, conversation_id INT ); INSERT INTO messages VALUES (1, 'TEST 3', '12 DEC 2021', 1); INSERT INTO messages VALUES (2, 'TEST 2', '11 DEC 2021', 1); INSERT INTO messages VALUES (3, 'TEST 1', '10 DEC 2021', 1); INSERT INTO messages VALUES (4, 'TEST CONV2 1', '10 DEC 2021', 2); </code></pre> <p>Query:</p> <pre class="prettyprint"><code> WITH Latest_Conversation_Messages AS ( SELECT conversation_id, content, created_at, ROW_NUMBER() OVER (PARTITION BY conversation_id ORDER BY created_at DESC) rn FROM messages ) SELECT conversation_id, content AS last_message_content, created_at AS last_message_at FROM Latest_Conversation_Messages WHERE rn = 1 </code></pre> <p>Output:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th style="text-align: right;">conversation_id</th> <th style="text-align: left;">last_message_content</th> <th style="text-align: left;">last_message_at</th> </tr></thead> <tbody> <tr> <td style="text-align: right;">1</td> <td style="text-align: left;">TEST 3</td> <td style="text-align: left;">2021-12-12</td> </tr> <tr> <td style="text-align: right;">2</td> <td style="text-align: left;">TEST CONV2 1</td> <td style="text-align: left;">2021-12-10</td> </tr> </tbody> </table> </div> <p>Query2:</p> <pre class="prettyprint"><code>WITH Latest_Conversation_Messages AS ( SELECT conversation_id, content, created_at , ROW_NUMBER() OVER (PARTITION BY conversation_id ORDER BY created_at DESC) rn FROM messages m WHERE EXISTS (SELECT 1 FROM conversations c WHERE c.id = m.conversation_id) ) SELECT conversation_id, content AS last_message_content, created_at AS last_message_at FROM Latest_Conversation_Messages WHERE rn = 1 </code></pre> <p>Output:</p> <div class="s-table-container"> <table class="s-table"> <thead><tr> <th style="text-align: right;">conversation_id</th> <th style="text-align: left;">last_message_content</th> <th style="text-align: left;">last_message_at</th> </tr></thead> <tbody> <tr> <td style="text-align: right;">1</td> <td style="text-align: left;">TEST 3</td> <td style="text-align: left;">2021-12-12</td> </tr> <tr> <td style="text-align: right;">2</td> <td style="text-align: left;">TEST CONV2 1</td> <td style="text-align: left;">2021-12-10</td> </tr> </tbody> </table> </div> <p><em>db<>fiddle here</em></p>

<p>Before measuring performance and comparing <code>explains</code> from different queries it usually worth to try and mimic your production setup first or you may (and almost certainly will) get a misleading EXPLAIN path (e.g. <code>seq scan</code> is used instead of <code>index scan</code> on small tables as sequential will be faster than random IO in such case)</p> <p>I tried to tackle it in this way:</p> <p>First - generate 200K conversations within the past 30 days</p> <pre class="prettyprint"><code>insert into conversations(id, name, created_at) select generate_series, 'CONVERSATION_'||generate_series, NOW() - (random() * (interval '30 days')) from generate_series(1, 200000); </code></pre> <p>Second - generate 2M messages randomly distributed among 200K conversations and then also intentionally created 5 more "most recent" messages for conversation with ID=999, so that the conversation 999 has to always appear on top of query result.</p> <pre class="prettyprint"><code>insert into messages(id, content, conversation_id, created_at) select msg.id, content, conversation_id, created_at + (random() * (interval '7 days')) from ( select distinct generate_series as id, 'Message content ' || generate_series as content, 1 + trunc(random() * 200000) as conversation_id from generate_series(1, 2000000) ) msg join conversations c on c.id = conversation_id; insert into messages(id, content, conversation_id, created_at) select generate_series as id, 'Message content ' || generate_series as content, 999 as conversation_id, now() + interval '7 day' + (random() * (interval '7 days')) from generate_series(2000001, 2000006); </code></pre> <p>And now you can try and compare (now with a little bit more confidence) those EXPLAINs to see which query works better.</p> <p>Assuming you added the proposed index</p> <pre class="prettyprint"><code>CREATE INDEX idx1 ON messages(conversation_id, created_at desc) </code></pre> <ul> <li>Both @GoonerForLife and @asinkxcoswt answers are pretty good, though the result is moderate because of window function usage</li> </ul> <blockquote> <p><em>with cost=250000 on average and 2 to 3 seconds execution time on my machine</em></p> </blockquote> <ul> <li>@SalmanA and @ESG answers are twice as fast even though the <code>lateral join</code> will force query planner to choose sequential scan (this is inevitable as the <code>join</code> is <code>on TRUE</code>, so no index could be used)</li> </ul> <blockquote> <p><em>with cost ~150000 on average and 1-1.5 seconds execution time on my machine</em></p> </blockquote> <hr> <p>It may not be obvious at first, but the @ElapsedSoul's answers with NOT EXISTS is quite close to ideal (though it still needs a couple of tweaks):</p> <p>(1) It lacks the order by and limit:</p> <pre class="prettyprint"><code>select conversations.id, m.content AS last_message_content, m.created_at AS last_message_at from conversations,messages m where conversations.id = m.conversation_id and not exists ( select 1 from messages n where m.conversation_id = n.conversation_id and m.created_at < n.created_at ) order by last_message_at desc limit 15 </code></pre> <p>And (2) Since there is date-to-date comparison inside the NOT EXISTS subquery - we have to add another index on massages table</p> <pre class="prettyprint"><code>CREATE INDEX ix2 ON messages(created_at desc); </code></pre> <p>After that we should get a decent performance gain. For example on my machine it resulted in <code>0.036ms</code> execution time and <code>20.07 cost</code></p>

<p>Another version worth trying</p> <pre class="prettyprint"><code>SELECT * FROM ( SELECT cv.id, ms.content AS last_message_content, ms.created_at AS last_message_at, row_number() over (partition by cv.id order by ms.created_at desc) rank FROM conversations cv JOIN messages ms on (cv.id = ms.conversation_id) ) t WHERE t.rank = 1 ORDER BY t.last_message_at DESC LIMIT 15 OFFSET 0; </code></pre> <p>Since we only need the first <code>n</code> (15) conversations, if the database optimizer fails to figure out this (need to test against the actual data to see), it will cause full table scan.</p> <p>To help the optimizer, we can tell it to select first <code>n</code> rows from <code>conversations</code> before joining.</p> <pre class="prettyprint"><code>WITH cv as ( SELECT id , created_at from conversations ORDER BY created_at DESC LIMIT 15 OFFSET 0 ) SELECT * FROM ( SELECT cv.id, ms.content AS last_message_content, ms.created_at AS last_message_at, row_number() over (partition by cv.id order by ms.created_at desc) rank FROM cv JOIN messages ms on (cv.id = ms.conversation_id) ) t WHERE t.rank = 1 ORDER BY t.last_message_at DESC ; </code></pre> <p>But with limitation that you will get the latest <code>conversations</code> instead of the latest <code>messages</code>. To fix this, you can add <code>last_updated_at</code> to the <code>conversations</code>.</p> <p>Please note that an index on <code>conversations.created_at</code> (or <code>last_updated_at</code> if you want) is a critical factor here, please don't fotget to have one.</p>

Getting top rows with the latest related row in joined table quickly

Tags:

sql

postgresql

greatest-n-per-group

postgresql-performance

postgresql-13

There are two tables conversations and messages, I want to fetch conversations along with the content of their latest message.

conversations - id(PRIMARY KEY), name, created_at

messages - id, content, created_at, conversation_id

Currently we are running this query to get the required data

SELECT
    conversations.id,
    m.content AS last_message_content,
    m.created_at AS last_message_at
FROM
    conversations
INNER JOIN messages m ON conversations.id = m.conversation_id
                     AND m.id = (
    SELECT
        id
    FROM
        messages _m
    WHERE
        m.conversation_id = _m.conversation_id
    ORDER BY
        created_at DESC
    LIMIT 1)
ORDER BY
    last_message_at DESC
LIMIT 15
OFFSET 0

The above query is returning the valid data but its performance decreases with the increasing number of rows. Is there any other way to write this query with increased performance? Attaching the fiddle for example.

http://sqlfiddle.com/#!17/2decb/2

Also tried the suggestions in one of the deleted answers:

SELECT DISTINCT ON (c.id)
       c.id,
       m.content AS last_message_content,
       m.created_at AS last_message_at
  FROM conversations AS c
 INNER JOIN messages AS m
    ON c.id = m.conversation_id 
 ORDER BY c.id, m.created_at DESC
 LIMIT 15 OFFSET 0

http://sqlfiddle.com/#!17/2decb/5

But the problem with this query is it doesn't sort by m.created_at. I want the resultset to be sorted by m.created_at DESC

472

asked Nov 25 '21 19:11

Rajdeep Singh

Video Answer

9 Answers

Since no column other than id from conversations is selected in the result you can only query the messages table to reduce time (Query1). If there is a possibility to have any messages for which conversation_id is not available in conversations table and you don't want to select those then you can use second query.

Schema and insert statements:

CREATE TABLE conversations 
(
    id INT, 
    name VARCHAR(200), 
    created_at DATE
);

INSERT INTO conversations VALUES (1, 'CONV1', '1 DEC 2021');
INSERT INTO conversations VALUES (2, 'CONV2', '1 DEC 2021');

CREATE TABLE messages 
(
    id INT, 
    content VARCHAR(200), 
    created_at DATE, 
    conversation_id INT
);

INSERT INTO messages VALUES (1, 'TEST 3', '12 DEC 2021', 1);
INSERT INTO messages VALUES (2, 'TEST 2', '11 DEC 2021', 1);
INSERT INTO messages VALUES (3, 'TEST 1', '10 DEC 2021', 1);
INSERT INTO messages VALUES (4, 'TEST CONV2 1', '10 DEC 2021', 2);

Query:

 WITH Latest_Conversation_Messages AS
 ( 
     SELECT
         conversation_id, content, created_at, 
         ROW_NUMBER() OVER (PARTITION BY conversation_id ORDER BY created_at DESC) rn 
     FROM
         messages
 )
 SELECT
     conversation_id, content AS last_message_content,
     created_at AS last_message_at 
 FROM
     Latest_Conversation_Messages 
 WHERE
     rn = 1

Output:

conversation_id	last_message_content	last_message_at
1	TEST 3	2021-12-12
2	TEST CONV2 1	2021-12-10

Query2:

WITH Latest_Conversation_Messages AS
( 
    SELECT
        conversation_id, content, created_at , 
        ROW_NUMBER() OVER (PARTITION BY conversation_id ORDER BY created_at DESC) rn 
    FROM
        messages m
    WHERE
        EXISTS (SELECT 1 FROM conversations c 
                WHERE c.id = m.conversation_id)
)
SELECT 
    conversation_id, content AS last_message_content,
    created_at AS last_message_at 
FROM
    Latest_Conversation_Messages 
WHERE
    rn = 1

Output:

conversation_id	last_message_content	last_message_at
1	TEST 3	2021-12-12
2	TEST CONV2 1	2021-12-10

db<>fiddle here

answered Oct 21 '22 00:10

Kazi Mohammad Ali Nur

Using not exists sub-statement ,

select
conversations.id,
m.content AS last_message_content,
m.created_at AS last_message_at
from conversations,messages m
where conversations.id = m.conversation_id 
and not exists (select 1 from messages n 
where m.conversation_id = n.conversation_id 
and m.created_at < n.created_at)

create an index on conversation_id of table messages would be good.

answered Oct 21 '22 00:10

ElapsedSoul

Before measuring performance and comparing explains from different queries it usually worth to try and mimic your production setup first or you may (and almost certainly will) get a misleading EXPLAIN path (e.g. seq scan is used instead of index scan on small tables as sequential will be faster than random IO in such case)

I tried to tackle it in this way:

First - generate 200K conversations within the past 30 days

insert into conversations(id, name, created_at)
select 
    generate_series, 
    'CONVERSATION_'||generate_series, 
    NOW() - (random() * (interval '30 days')) 
from generate_series(1, 200000);

Second - generate 2M messages randomly distributed among 200K conversations and then also intentionally created 5 more "most recent" messages for conversation with ID=999, so that the conversation 999 has to always appear on top of query result.

insert into messages(id, content, conversation_id, created_at)
select msg.id, content, conversation_id, created_at + (random() * (interval '7 days')) from (
    select distinct
        generate_series as id, 
        'Message content ' || generate_series as content,
        1 + trunc(random() * 200000) as conversation_id
    from generate_series(1, 2000000)
) msg
join conversations c on c.id = conversation_id;

insert into messages(id, content, conversation_id, created_at)
select 
    generate_series as id, 
    'Message content ' || generate_series as content,
    999 as conversation_id,
    now() + interval '7 day' + (random() * (interval '7 days'))
from generate_series(2000001, 2000006);

And now you can try and compare (now with a little bit more confidence) those EXPLAINs to see which query works better.

Assuming you added the proposed index

CREATE INDEX idx1 ON messages(conversation_id, created_at desc)

Both @GoonerForLife and @asinkxcoswt answers are pretty good, though the result is moderate because of window function usage

with cost=250000 on average and 2 to 3 seconds execution time on my machine

@SalmanA and @ESG answers are twice as fast even though the lateral join will force query planner to choose sequential scan (this is inevitable as the join is on TRUE, so no index could be used)

with cost ~150000 on average and 1-1.5 seconds execution time on my machine

It may not be obvious at first, but the @ElapsedSoul's answers with NOT EXISTS is quite close to ideal (though it still needs a couple of tweaks):

(1) It lacks the order by and limit:

select 
    conversations.id, 
    m.content AS last_message_content, 
    m.created_at AS last_message_at
from conversations,messages m
where conversations.id = m.conversation_id and not exists (
    select 1 from messages n 
    where m.conversation_id = n.conversation_id and m.created_at < n.created_at
) order by last_message_at desc
limit 15

And (2) Since there is date-to-date comparison inside the NOT EXISTS subquery - we have to add another index on massages table

CREATE INDEX ix2 ON messages(created_at desc);

After that we should get a decent performance gain. For example on my machine it resulted in 0.036ms execution time and 20.07 cost

answered Oct 20 '22 23:10

dshelya

12 answers already. (!) And yet, here is number 13 with a query to outperform all of them.

All you need is an index on messages.created_at for this to be fast. Like:

CREATE INDEX ON messages (created_at DESC);

Works with ascending order, too, almost at the same speed

WITH RECURSIVE cte AS (
   (  -- parentheses required
   SELECT conversation_id, content, created_at, ARRAY[conversation_id] AS latest_ids
   FROM   messages
   ORDER  BY created_at DESC
   LIMIT  1
   )
   UNION ALL
   SELECT m.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT m.conversation_id, m.content, m.created_at, c.latest_ids || m.conversation_id
      FROM   messages m
      WHERE  m.created_at <= c.created_at
      AND    m.conversation_id <> ALL (c.latest_ids)
      ORDER  BY m.created_at DESC
      LIMIT  1
      ) m
   -- WHERE  cardinality(c.latest_ids) < 15  -- not necessary
   )
SELECT conversation_id
     , content    AS last_message_content
     , created_at AS last_message_at
FROM   cte
LIMIT  15;

db<>fiddle here

Why? How?

Key is your small LIMIT, taking only the 15 "latest" conversations. (The ones with the latest messages, to be precise.) So very few. There are better query styles to get all or most conversations with their respective latest message.

Your table is big (you mentioned an "increasing number of rows") so the all-essential virtue must be to avoid a sequential scan over the big table - or even an index-scan over the whole table. That's what this query achieves.

The query emulates an index-skip scan with a recursive CTE. The non-recursive part takes the latest message, the recursive part takes the next-latest message from a different conversation. latest_ids keeps track of the selected conversations to avoid dupes.

Postgres will stop recursion as soon as enough rows have been selected to meet the outer LIMIT. So the added, but commented, break condition in the recursive term is not needed.

Unless there are too many messages in each conversation, so that a large amount of rows would have to be filtered (seems extremely unlikely), this should be as good as it gets.

See:

Optimize GROUP BY query to retrieve latest row per user (paragraph 1a)
SELECT DISTINCT is slower than expected on my table in PostgreSQL
Select first row in each GROUP BY group?

Oh, and we don't need to involve the table conversations at all. We get conversation_id from table messages and that's all you want.

answered Oct 20 '22 22:10

Erwin Brandstetter

Have you tried a lateral join instead?

SELECT
    conversations.id,
    m.content AS last_message_content,
    m.created_at AS last_message_at
FROM "conversations" 
INNER JOIN LATERAL (
  SELECT content, created_at 
  FROM  messages m
  WHERE conversations.id = m.conversation_id 
  ORDER BY created_at DESC
  FETCH FIRST 1 ROW ONLY
) m ON TRUE
ORDER BY last_message_at DESC
LIMIT 15 OFFSET 0

answered Oct 20 '22 22:10

ESG

Try this one.

WITH cte AS
(
    SELECT
        conversations.id,
        m.content AS last_message_content,
        m.created_at AS last_message_at,
        MAX(M.created_at) OVER(PARTITION BY conversations.id) = M.created_at AS isLatest
    FROM conversations 
    INNER JOIN messages m 
      ON conversations.id = m.conversation_id
)
SELECT ID, last_message_content, last_message_at
FROM cte 
WHERE isLatest
ORDER BY last_message_at DESC
LIMIT 15 OFFSET 0

answered Oct 21 '22 00:10

GoonerForLife

I second the answer with lateral view, and I may suggest a Transact-SQL variant with CROSS-APPLY

SELECT
    conversations.id,
    m.content AS last_message_content,
    m.created_at AS last_message_at
FROM "conversations" 
outer apply (
  SELECT top 1 content, created_at 
  FROM  messages m
  WHERE conversations.id = m.conversation_id 
  ORDER BY created_at DESC
) m
 ORDER BY
    last_message_at DESC
LIMIT 15 OFFSET 0

answered Oct 20 '22 23:10

Igor N.

Another version worth trying

SELECT *
FROM (
SELECT
    cv.id,
    ms.content AS last_message_content,
    ms.created_at AS last_message_at,
    row_number() over (partition by cv.id order by ms.created_at desc) rank
FROM conversations cv
JOIN messages ms on (cv.id = ms.conversation_id)
) t
WHERE t.rank = 1
ORDER BY t.last_message_at DESC
LIMIT 15 OFFSET 0;

Since we only need the first n (15) conversations, if the database optimizer fails to figure out this (need to test against the actual data to see), it will cause full table scan.

To help the optimizer, we can tell it to select first n rows from conversations before joining.

WITH cv as (
SELECT id
, created_at
from conversations
ORDER BY created_at DESC
LIMIT 15 OFFSET 0
)
SELECT *
FROM (
SELECT
    cv.id,
    ms.content AS last_message_content,
    ms.created_at AS last_message_at,
    row_number() over (partition by cv.id order by ms.created_at desc) rank
FROM cv
JOIN messages ms on (cv.id = ms.conversation_id)
) t
WHERE t.rank = 1
ORDER BY t.last_message_at DESC
;

But with limitation that you will get the latest conversations instead of the latest messages. To fix this, you can add last_updated_at to the conversations.

Please note that an index on conversations.created_at (or last_updated_at if you want) is a critical factor here, please don't fotget to have one.

answered Oct 20 '22 22:10

asinkxcoswt

Bringing together a few of the different ideas in here together, the select distinct on with a subquery for sort, and using only the messages table since it houses all the required information.

SELECT
conversation_id AS id,
content AS last_message_content,
created_at AS last_message_at
  FROM
     (SELECT DISTINCT ON (conversation_id)
         conversation_id,
         content,
         created_at
       FROM messages
       ORDER BY conversation_id, created_at DESC
       LIMIT 15 OFFSET 0) 
       as a
ORDER BY created_at DESC

http://sqlfiddle.com/#!17/2decb/134/0 - cost of 16.9

I'm not sure how big the actual dataset it, and while a partition as used by others may perform well on the sqlfiddle available it could cause some longer run times over a larger set.

You could also set up the above as a CTE, depending on preference -

WITH SUBQUERY AS
(SELECT DISTINCT ON (conversation_id)
   conversation_id,
   content,
   created_at
  FROM messages
  ORDER BY conversation_id, created_at DESC
  LIMIT 15 OFFSET 0)

SELECT
  conversation_id AS id,
  content AS last_message_content,
  created_at AS last_message_at
FROM SUBQUERY
ORDER BY created_at DESC

http://sqlfiddle.com/#!17/2decb/135/0 - cost of 17

If the join to conversations is required, this would work - http://sqlfiddle.com/#!17/2decb/144

answered Oct 20 '22 23:10

procopypaster

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Getting top rows with the latest related row in joined table quickly

Tags:

sql

postgresql

greatest-n-per-group

postgresql-performance

postgresql-13

Rajdeep Singh

People also ask

Video Answer

9 Answers

Kazi Mohammad Ali Nur

ElapsedSoul

dshelya

Why? How?

Erwin Brandstetter

ESG

GoonerForLife

Igor N.

asinkxcoswt

procopypaster

Recent Activity

Donate For Us

Getting top rows with the latest related row in joined table quickly

Tags:

sql

postgresql

greatest-n-per-group

postgresql-performance

postgresql-13

Rajdeep Singh

People also ask

Video Answer

9 Answers

Kazi Mohammad Ali Nur

ElapsedSoul

dshelya

Why? How?

Erwin Brandstetter

ESG

GoonerForLife

Igor N.

asinkxcoswt

procopypaster

Related questions

Recent Activity

Donate For Us