Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

alternative to SQL count subquery

Tags:

sql

postgresql

I have the following query:

SELECT DISTINCT 
    e.id, 
    folder, 
    subject, 
    in_reply_to, 
    message_id, 
    "references", 
    e.updated_at,
    (
        select count(*)  
        from emails  
        where 
        (
            select "references"[1] 
            from emails 
            where message_id = e.message_id
        ) = ANY ("references") 
        or message_id = 
        (
            select "references"[1] 
            from emails 
            where message_id = e.message_id
        )
    )
FROM "emails" e
INNER JOIN "email_participants" 
    ON ("email_participants"."email_id" = e."id") 
WHERE (("user_id" = 220) 
AND ("folder" = 'INBOX')) 
ORDER BY e."updated_at" DESC 
LIMIT 10 OFFSET 0;

Here is the explain analyze output of the above query.

The query peformed fine until I added the count subquery below:

(
    select count(*)  
    from emails  
    where 
    (
        select "references"[1] 
        from emails 
        where message_id = e.message_id
    ) = ANY ("references") 
    or message_id = 
    (
        select "references"[1] 
        from emails 
        where message_id = e.message_id
    )
)

In fact I have tried simpler subqueries and it seems to be the aggregate function itself that is taking the time.

Is then an alternative way that I could append the count subquery onto each result? Should I update the results after the initial query has run for example?

Here is a pastebin that will create the table and also run the badly performing query at the end to display what the output should be.

like image 914
dagda1 Avatar asked May 09 '14 12:05

dagda1


1 Answers

Expanding on Paul Guyot's answer you could move the subquery into a derived table, which should perform faster because it fetches the message counts in one scan (plus a join) as opposed to 1 scan per row.

SELECT DISTINCT 
    e.id, 
    e.folder, 
    e.subject, 
    in_reply_to, 
    e.message_id, 
    e."references", 
    e.updated_at,
    t1.message_count
FROM "emails" e
INNER JOIN "email_participants" 
    ON ("email_participants"."email_id" = e."id") 
INNER JOIN (
    SELECT COUNT(e2.id) message_count, e.message_id
    FROM emails e
    LEFT JOIN emails e2 ON (ARRAY[e."references"[1]] <@ e2."references"
    OR e2.message_id = e."references"[1])
    GROUP BY e.message_id
) t1 ON t1.message_id = e.message_id
WHERE (("user_id" = 220) 
AND ("folder" = 'INBOX')) 
ORDER BY e."updated_at" DESC 
LIMIT 10 OFFSET 0;

Fiddle using pastebin data - http://www.sqlfiddle.com/#!15/c6298/7

Below are the query plans postgres produces for getting count in a correlated subquery vs getting count by joining a derived table. I used one of my own tables but I think the results should be similar.

Correlated Subquery

"Limit  (cost=0.00..1123641.81 rows=1000 width=8) (actual time=11.237..5395.237 rows=1000 loops=1)"
"  ->  Seq Scan on visit v  (cost=0.00..44996236.24 rows=40045 width=8) (actual time=11.236..5395.014 rows=1000 loops=1)"
"        SubPlan 1"
"          ->  Aggregate  (cost=1123.61..1123.62 rows=1 width=0) (actual time=5.393..5.393 rows=1 loops=1000)"
"                ->  Seq Scan on visit v2  (cost=0.00..1073.56 rows=20018 width=0) (actual time=0.002..4.280 rows=21393 loops=1000)"
"                      Filter: (company_id = v.company_id)"
"                      Rows Removed by Filter: 18653"
"Total runtime: 5395.369 ms"

Joining a Derived Table

"Limit  (cost=1173.74..1211.81 rows=1000 width=12) (actual time=21.819..22.629 rows=1000 loops=1)"
"  ->  Hash Join  (cost=1173.74..2697.72 rows=40036 width=12) (actual time=21.817..22.465 rows=1000 loops=1)"
"        Hash Cond: (v.company_id = visit.company_id)"
"        ->  Seq Scan on visit v  (cost=0.00..973.45 rows=40045 width=8) (actual time=0.010..0.198 rows=1000 loops=1)"
"        ->  Hash  (cost=1173.71..1173.71 rows=2 width=12) (actual time=21.787..21.787 rows=2 loops=1)"
"              Buckets: 1024  Batches: 1  Memory Usage: 1kB"
"              ->  HashAggregate  (cost=1173.67..1173.69 rows=2 width=4) (actual time=21.783..21.784 rows=3 loops=1)"
"                    ->  Seq Scan on visit  (cost=0.00..973.45 rows=40045 width=4) (actual time=0.003..6.695 rows=40046 loops=1)"
"Total runtime: 22.806 ms"
like image 130
FuzzyTree Avatar answered Oct 12 '22 22:10

FuzzyTree