There are two tables conversations
and messages
, I want to fetch conversations along with the content of their latest message.
conversations
- id(PRIMARY KEY), name, created_at
messages
- id, content, created_at, conversation_id
Currently we are running this query to get the required data
SELECT
conversations.id,
m.content AS last_message_content,
m.created_at AS last_message_at
FROM
conversations
INNER JOIN messages m ON conversations.id = m.conversation_id
AND m.id = (
SELECT
id
FROM
messages _m
WHERE
m.conversation_id = _m.conversation_id
ORDER BY
created_at DESC
LIMIT 1)
ORDER BY
last_message_at DESC
LIMIT 15
OFFSET 0
The above query is returning the valid data but its performance decreases with the increasing number of rows. Is there any other way to write this query with increased performance? Attaching the fiddle for example.
http://sqlfiddle.com/#!17/2decb/2
Also tried the suggestions in one of the deleted answers:
SELECT DISTINCT ON (c.id)
c.id,
m.content AS last_message_content,
m.created_at AS last_message_at
FROM conversations AS c
INNER JOIN messages AS m
ON c.id = m.conversation_id
ORDER BY c.id, m.created_at DESC
LIMIT 15 OFFSET 0
http://sqlfiddle.com/#!17/2decb/5
But the problem with this query is it doesn't sort by m.created_at
. I want the resultset to be sorted by m.created_at DESC
Typically, these are accomplished using the TOP or LIMIT clause. Problem is, Top N result sets are limited to the highest values in the table, without any grouping. The GROUP BY clause can help with that, but it is limited to the single top result for each group.
To do that, you can use the ROW_NUMBER() function. In OVER() , you specify the groups into which the rows should be divided ( PARTITION BY ) and the order in which the numbers should be assigned to the rows ( ORDER BY ).
To get the first and last record, use UNION. LIMIT is also used to get the number of records you want.
Since no column other than id from conversations is selected in the result you can only query the messages table to reduce time (Query1). If there is a possibility to have any messages for which conversation_id
is not available in conversations table and you don't want to select those then you can use second query.
Schema and insert statements:
CREATE TABLE conversations
(
id INT,
name VARCHAR(200),
created_at DATE
);
INSERT INTO conversations VALUES (1, 'CONV1', '1 DEC 2021');
INSERT INTO conversations VALUES (2, 'CONV2', '1 DEC 2021');
CREATE TABLE messages
(
id INT,
content VARCHAR(200),
created_at DATE,
conversation_id INT
);
INSERT INTO messages VALUES (1, 'TEST 3', '12 DEC 2021', 1);
INSERT INTO messages VALUES (2, 'TEST 2', '11 DEC 2021', 1);
INSERT INTO messages VALUES (3, 'TEST 1', '10 DEC 2021', 1);
INSERT INTO messages VALUES (4, 'TEST CONV2 1', '10 DEC 2021', 2);
Query:
WITH Latest_Conversation_Messages AS
(
SELECT
conversation_id, content, created_at,
ROW_NUMBER() OVER (PARTITION BY conversation_id ORDER BY created_at DESC) rn
FROM
messages
)
SELECT
conversation_id, content AS last_message_content,
created_at AS last_message_at
FROM
Latest_Conversation_Messages
WHERE
rn = 1
Output:
conversation_id | last_message_content | last_message_at |
---|---|---|
1 | TEST 3 | 2021-12-12 |
2 | TEST CONV2 1 | 2021-12-10 |
Query2:
WITH Latest_Conversation_Messages AS
(
SELECT
conversation_id, content, created_at ,
ROW_NUMBER() OVER (PARTITION BY conversation_id ORDER BY created_at DESC) rn
FROM
messages m
WHERE
EXISTS (SELECT 1 FROM conversations c
WHERE c.id = m.conversation_id)
)
SELECT
conversation_id, content AS last_message_content,
created_at AS last_message_at
FROM
Latest_Conversation_Messages
WHERE
rn = 1
Output:
conversation_id | last_message_content | last_message_at |
---|---|---|
1 | TEST 3 | 2021-12-12 |
2 | TEST CONV2 1 | 2021-12-10 |
db<>fiddle here
Using not exists
sub-statement ,
select
conversations.id,
m.content AS last_message_content,
m.created_at AS last_message_at
from conversations,messages m
where conversations.id = m.conversation_id
and not exists (select 1 from messages n
where m.conversation_id = n.conversation_id
and m.created_at < n.created_at)
create an index on conversation_id
of table messages
would be good.
Before measuring performance and comparing explains
from different queries it usually worth to try and mimic your production setup first or you may (and almost certainly will) get a misleading EXPLAIN path (e.g. seq scan
is used instead of index scan
on small tables as sequential will be faster than random IO in such case)
I tried to tackle it in this way:
First - generate 200K conversations within the past 30 days
insert into conversations(id, name, created_at)
select
generate_series,
'CONVERSATION_'||generate_series,
NOW() - (random() * (interval '30 days'))
from generate_series(1, 200000);
Second - generate 2M messages randomly distributed among 200K conversations and then also intentionally created 5 more "most recent" messages for conversation with ID=999, so that the conversation 999 has to always appear on top of query result.
insert into messages(id, content, conversation_id, created_at)
select msg.id, content, conversation_id, created_at + (random() * (interval '7 days')) from (
select distinct
generate_series as id,
'Message content ' || generate_series as content,
1 + trunc(random() * 200000) as conversation_id
from generate_series(1, 2000000)
) msg
join conversations c on c.id = conversation_id;
insert into messages(id, content, conversation_id, created_at)
select
generate_series as id,
'Message content ' || generate_series as content,
999 as conversation_id,
now() + interval '7 day' + (random() * (interval '7 days'))
from generate_series(2000001, 2000006);
And now you can try and compare (now with a little bit more confidence) those EXPLAINs to see which query works better.
Assuming you added the proposed index
CREATE INDEX idx1 ON messages(conversation_id, created_at desc)
with cost=250000 on average and 2 to 3 seconds execution time on my machine
lateral join
will force query planner to choose sequential scan (this is inevitable as the join
is on TRUE
, so no index could be used)with cost ~150000 on average and 1-1.5 seconds execution time on my machine
It may not be obvious at first, but the @ElapsedSoul's answers with NOT EXISTS is quite close to ideal (though it still needs a couple of tweaks):
(1) It lacks the order by and limit:
select
conversations.id,
m.content AS last_message_content,
m.created_at AS last_message_at
from conversations,messages m
where conversations.id = m.conversation_id and not exists (
select 1 from messages n
where m.conversation_id = n.conversation_id and m.created_at < n.created_at
) order by last_message_at desc
limit 15
And (2) Since there is date-to-date comparison inside the NOT EXISTS subquery - we have to add another index on massages table
CREATE INDEX ix2 ON messages(created_at desc);
After that we should get a decent performance gain.
For example on my machine it resulted in 0.036ms
execution time and 20.07 cost
12 answers already. (!) And yet, here is number 13 with a query to outperform all of them.
All you need is an index on messages.created_at
for this to be fast. Like:
CREATE INDEX ON messages (created_at DESC);
Works with ascending order, too, almost at the same speed
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT conversation_id, content, created_at, ARRAY[conversation_id] AS latest_ids
FROM messages
ORDER BY created_at DESC
LIMIT 1
)
UNION ALL
SELECT m.*
FROM cte c
CROSS JOIN LATERAL (
SELECT m.conversation_id, m.content, m.created_at, c.latest_ids || m.conversation_id
FROM messages m
WHERE m.created_at <= c.created_at
AND m.conversation_id <> ALL (c.latest_ids)
ORDER BY m.created_at DESC
LIMIT 1
) m
-- WHERE cardinality(c.latest_ids) < 15 -- not necessary
)
SELECT conversation_id
, content AS last_message_content
, created_at AS last_message_at
FROM cte
LIMIT 15;
db<>fiddle here
Key is your small LIMIT
, taking only the 15 "latest" conversations. (The ones with the latest messages, to be precise.) So very few. There are better query styles to get all or most conversations with their respective latest message.
Your table is big (you mentioned an "increasing number of rows") so the all-essential virtue must be to avoid a sequential scan over the big table - or even an index-scan over the whole table. That's what this query achieves.
The query emulates an index-skip scan with a recursive CTE. The non-recursive part takes the latest message, the recursive part takes the next-latest message from a different conversation. latest_ids
keeps track of the selected conversations to avoid dupes.
Postgres will stop recursion as soon as enough rows have been selected to meet the outer LIMIT
. So the added, but commented, break condition in the recursive term is not needed.
Unless there are too many messages in each conversation, so that a large amount of rows would have to be filtered (seems extremely unlikely), this should be as good as it gets.
See:
Oh, and we don't need to involve the table conversations at all. We get conversation_id
from table messages
and that's all you want.
Have you tried a lateral join instead?
SELECT
conversations.id,
m.content AS last_message_content,
m.created_at AS last_message_at
FROM "conversations"
INNER JOIN LATERAL (
SELECT content, created_at
FROM messages m
WHERE conversations.id = m.conversation_id
ORDER BY created_at DESC
FETCH FIRST 1 ROW ONLY
) m ON TRUE
ORDER BY last_message_at DESC
LIMIT 15 OFFSET 0
Try this one.
WITH cte AS
(
SELECT
conversations.id,
m.content AS last_message_content,
m.created_at AS last_message_at,
MAX(M.created_at) OVER(PARTITION BY conversations.id) = M.created_at AS isLatest
FROM conversations
INNER JOIN messages m
ON conversations.id = m.conversation_id
)
SELECT ID, last_message_content, last_message_at
FROM cte
WHERE isLatest
ORDER BY last_message_at DESC
LIMIT 15 OFFSET 0
I second the answer with lateral view, and I may suggest a Transact-SQL variant with CROSS-APPLY
SELECT
conversations.id,
m.content AS last_message_content,
m.created_at AS last_message_at
FROM "conversations"
outer apply (
SELECT top 1 content, created_at
FROM messages m
WHERE conversations.id = m.conversation_id
ORDER BY created_at DESC
) m
ORDER BY
last_message_at DESC
LIMIT 15 OFFSET 0
Another version worth trying
SELECT *
FROM (
SELECT
cv.id,
ms.content AS last_message_content,
ms.created_at AS last_message_at,
row_number() over (partition by cv.id order by ms.created_at desc) rank
FROM conversations cv
JOIN messages ms on (cv.id = ms.conversation_id)
) t
WHERE t.rank = 1
ORDER BY t.last_message_at DESC
LIMIT 15 OFFSET 0;
Since we only need the first n
(15) conversations, if the database optimizer fails to figure out this (need to test against the actual data to see), it will cause full table scan.
To help the optimizer, we can tell it to select first n
rows from conversations
before joining.
WITH cv as (
SELECT id
, created_at
from conversations
ORDER BY created_at DESC
LIMIT 15 OFFSET 0
)
SELECT *
FROM (
SELECT
cv.id,
ms.content AS last_message_content,
ms.created_at AS last_message_at,
row_number() over (partition by cv.id order by ms.created_at desc) rank
FROM cv
JOIN messages ms on (cv.id = ms.conversation_id)
) t
WHERE t.rank = 1
ORDER BY t.last_message_at DESC
;
But with limitation that you will get the latest conversations
instead of the latest messages
. To fix this, you can add last_updated_at
to the conversations
.
Please note that an index on conversations.created_at
(or last_updated_at
if you want) is a critical factor here, please don't fotget to have one.
Bringing together a few of the different ideas in here together, the select distinct on with a subquery for sort, and using only the messages table since it houses all the required information.
SELECT
conversation_id AS id,
content AS last_message_content,
created_at AS last_message_at
FROM
(SELECT DISTINCT ON (conversation_id)
conversation_id,
content,
created_at
FROM messages
ORDER BY conversation_id, created_at DESC
LIMIT 15 OFFSET 0)
as a
ORDER BY created_at DESC
http://sqlfiddle.com/#!17/2decb/134/0 - cost of 16.9
I'm not sure how big the actual dataset it, and while a partition as used by others may perform well on the sqlfiddle available it could cause some longer run times over a larger set.
You could also set up the above as a CTE, depending on preference -
WITH SUBQUERY AS
(SELECT DISTINCT ON (conversation_id)
conversation_id,
content,
created_at
FROM messages
ORDER BY conversation_id, created_at DESC
LIMIT 15 OFFSET 0)
SELECT
conversation_id AS id,
content AS last_message_content,
created_at AS last_message_at
FROM SUBQUERY
ORDER BY created_at DESC
http://sqlfiddle.com/#!17/2decb/135/0 - cost of 17
If the join to conversations is required, this would work - http://sqlfiddle.com/#!17/2decb/144
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With