Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ActiveRecord OR operator slows down query by factor of 10. Why?

I have an ActiveRecord query that uses the OR operator to chain together 2 queries. The results come back fine, but the speed of executing the combined query is ~10 times as slow as executing either of the 2 queries in on their own.

We have an Event model and an Invitation model. A User can be invited to an Event by being targeted through an invitation filter, or by being individually invited by having an Invitation record.

So when determining how many users are invited to a particular event, we have to look at all those with Invitations and all those matching the filter. We do that here:

@invited_count = @invited_by_individual.or(@invited_by_filter).distinct.count(:id)

It's important to note, both @invited_by_individual and @invited_by_filter relations have references and includes statements within them.

Now, the problem is that when we execute that query, it takes about 1200ms. If we were to do the queries individually, each of them only take about 80ms. So @invited_by_filter.distinct.count and @invited_by_individual.distinct.count both return results in about 80ms, but neither of these is complete on its own.

Is there any way to speed up the query with the OR operator? Why is this happening in the first place?

Here is the SQL generated by the ActiveRecord queries:

Fast, single query:

(79.7ms)  
SELECT COUNT(DISTINCT "users"."id") 
FROM "users" 
LEFT OUTER JOIN "invitations" 
ON "invitations"."user_id" = "users"."id" 
WHERE "invitations"."event_id" = $1  [["event_id", 732]]

Slow, with combined query:

(1220.7ms)  
SELECT COUNT(DISTINCT "users"."id") 
FROM "users" 
LEFT OUTER JOIN "invitations" 
ON "invitations"."user_id" = "users"."id" 
WHERE ("invitations"."event_id" = $1 OR "users"."organization_id" = $2)  [["event_id", 732], ["organization_id", 13]]

Update, here's the EXPLAIN:

(1418.2ms)  SELECT COUNT(DISTINCT "users"."id") FROM "users" LEFT OUTER JOIN "invitations" ON "invitations"."user_id" = "users"."id" WHERE ("users"."root_organization_id" = $1 OR "invitations"."event_id" = $2)  [["root_organization_id", -1], ["event_id", 749]]
 => 
EXPLAIN for: SELECT COUNT(DISTINCT "users"."id") FROM "users" LEFT OUTER JOIN "invitations" ON "invitations"."user_id" = "users"."id" WHERE ("users"."root_organization_id" = $1 OR "invitations"."event_id" = $2) [["root_organization_id", -1], ["event_id", 749]]

 #=> QUERY PLAN
                                                     
 Aggregate  (cost=121781.56..121781.57 rows=1 width=8)
   ->  Hash Right Join  (cost=113248.88..121778.64 rows=1165 width=8)
         Hash Cond: (invitations.user_id = users.id)
         Filter: ((users.root_organization_id = '-1'::integer) OR (invitations.event_id = 749))
         ->  Seq Scan on invitations  (cost=0.00..1299.70 rows=63470 width=8)
         ->  Hash  (cost=93513.28..93513.28 rows=1135328 width=12)
               ->  Seq Scan on users  (cost=0.00..93513.28 rows=1135328 width=12)
(7 rows)

Update 2, EXPLAIN for queries ran individually, does use the indices:

(91.5ms)  SELECT COUNT(*) FROM "users" INNER JOIN "invitations" ON "invitations"."user_id" = "users"."id" WHERE "users"."root_organization_id" = $1  [["root_organization_id", -1]]
 => 
EXPLAIN for: SELECT COUNT(*) FROM "users" INNER JOIN "invitations" ON "invitations"."user_id" = "users"."id" WHERE "users"."root_organization_id" = $1 [["root_organization_id", -1]]

 #=> QUERY PLAN

 Aggregate  (cost=19.05..19.06 rows=1 width=8)
   ->  Nested Loop  (cost=0.72..19.05 rows=1 width=0)
         ->  Index Scan using index_users_on_root_organization_id on users  (cost=0.43..4.45 rows=1 width=8)
               Index Cond: (root_organization_id = '-1'::integer)
         ->  Index Only Scan using index_invitations_on_user_id on invitations  (cost=0.29..14.57 rows=3 width=4)
               Index Cond: (user_id = users.id)
(6 rows)

and

EXPLAIN for: SELECT COUNT(DISTINCT "users"."id") FROM "users" LEFT OUTER JOIN "invitations" ON "invitations"."user_id" = "users"."id" WHERE "invitations"."event_id" = $1 [["event_id", 749]]

 #=> QUERY PLAN

 Aggregate  (cost=536.34..536.35 rows=1 width=8)
   ->  Nested Loop  (cost=0.72..536.19 rows=62 width=8)
         ->  Index Scan using index_invitations_on_event_id on invitations  (cost=0.29..11.98 rows=62 width=4)
               Index Cond: (event_id = 749)
         ->  Index Only Scan using users_pkey on users  (cost=0.43..8.45 rows=1 width=8)
               Index Cond: (id = invitations.user_id)
(6 rows)
like image 832
D-Nice Avatar asked May 22 '21 23:05

D-Nice


2 Answers

UNION enables you to leverage both indexes while still preventing duplicates.

User.from(
"(#{@invited_by_individual.to_sql} 
UNION 
#{@invited_by_filter.to_sql})"
).count
like image 190
Lam Phan Avatar answered Oct 11 '22 18:10

Lam Phan


This is your query that uses OR:

SELECT COUNT(DISTINCT "users"."id") 
FROM "users" 
LEFT OUTER JOIN "invitations" 
ON "invitations"."user_id" = "users"."id" 
WHERE ("invitations"."event_id" = $1 OR "users"."organization_id" = $2)  

If you try the following query in Postgres I expect it to produce the same result, but work faster:

SELECT
    COUNT(DISTINCT id) AS cc
FROM
    (
        SELECT
            "invitations"."user_id" AS id
        FROM
            "invitations"
        WHERE
            ("invitations"."event_id" = $1)

        UNION ALL

        SELECT
            "users"."id"
        FROM
            "users" 
        WHERE
            ("users"."organization_id" = $2)
    ) AS T
;

If you have indexes on "invitations"."event_id" and on "users"."organization_id", engine should use them. If you don't have such indexes, create them.

The query with OR is slow, because the optimizer is not smart enough to perform this translation and split original query into two parts. When you run each part separately the engine sees that it can use an appropriate index. When the query joins two tables and have an OR condition in the WHERE filter no single index can return the rows needed, so the engine doesn't attempt to use any index. It reads all 1135328 rows from the users table and it reads all 63470 rows from the invitations table. Naturally, it is slow.

I have no idea how to translate this query to the ActiveRecord syntax.

like image 27
Vladimir Baranov Avatar answered Oct 11 '22 16:10

Vladimir Baranov