Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing 'Rank' for Contests in Postgres

I'm trying to determine if there a "low cost" optimization for the following query. We've implemented a system whereby 'tickets' earn 'points' and thus can be ranked. In order to support analytical type of queries, we store the rank of every ticket (tickets can be tied) with the ticket.

I've found that, at scale, updating this rank is very slow. I'm attempting to run the scenario below on a set of "tickets" that is about 20k tickets big.

I'm hoping that someone can help identify why and offer some help.

We're on postgres 9.3.6

Here's a simplified ticket table schema:

ogs_1=> \d api_ticket
                                             Table "public.api_ticket"
            Column            |           Type           |                        Modifiers                        
------------------------------+--------------------------+---------------------------------------------------------
 id                           | integer                  | not null default nextval('api_ticket_id_seq'::regclass)
 status                       | character varying(3)     | not null
 points_earned                | integer                  | not null
 rank                         | integer                  | not null
 event_id                     | integer                  | not null
 user_id                      | integer                  | not null
Indexes:
    "api_ticket_pkey" PRIMARY KEY, btree (id)
    "api_ticket_4437cfac" btree (event_id)
    "api_ticket_e8701ad4" btree (user_id)
    "api_ticket_points_earned_idx" btree (points_earned)
    "api_ticket_rank_idx" btree ("rank")
Foreign-key constraints:
    "api_ticket_event_id_598c97289edc0e3e_fk_api_event_id" FOREIGN KEY (event_id) REFERENCES api_event(id) DEFERRABLE INITIALLY DEFERRED
(user_id) REFERENCES auth_user(id) DEFERRABLE INITIALLY DEFERRED

Here's the query that I'm executing:

UPDATE api_ticket t SET rank = (
  SELECT rank
  FROM (SELECT Rank() over (
      Partition BY event_id ORDER BY points_earned DESC
    ) as rank, id
    FROM api_ticket tt
    WHERE event_id = t.event_id
      AND tt.status != 'x'
  ) as r
  WHERE r.id = t.id
)
WHERE event_id = <EVENT_ID> AND t.status != 'x';

Here's the explain on a set of about 10k rows:

Update on api_ticket t  (cost=0.00..1852176.70 rows=9646 width=88) (actual time=1254035.623..1254035.623 rows=0 loops=1)
   ->  Seq Scan on api_ticket t  (cost=0.00..1852176.70 rows=9646 width=88) (actual time=121.611..1253148.416 rows=9748 loops=1)
         Filter: (((status)::text <> 'x'::text) AND (event_id = 207))
         Rows Removed by Filter: 10
         SubPlan 1
           ->  Subquery Scan on r  (cost=159.78..191.97 rows=1 width=8) (actual time=87.466..128.537 rows=1 loops=9748)
                 Filter: (r.id = t.id)
                 Rows Removed by Filter: 9747
                 ->  WindowAgg  (cost=159.78..178.55 rows=1073 width=12) (actual time=46.389..108.954 rows=9748 loops=9748)
                       ->  Sort  (cost=159.78..162.46 rows=1073 width=12) (actual time=46.370..66.163 rows=9748 loops=9748)
                             Sort Key: tt.points_earned
                             Sort Method: quicksort  Memory: 799kB
                             ->  Index Scan using api_ticket_4437cfac on api_ticket tt  (cost=0.29..105.77 rows=1073 width=12) (actual time=2.698..26.448 rows=9748 loops=9748)
                                   Index Cond: (event_id = t.event_id)
                                   Filter: ((status)::text <> 'x'::text)
 Total runtime: 1254036.583 ms
like image 812
Brock Haywood Avatar asked Sep 28 '22 06:09

Brock Haywood


1 Answers

The correlated subquery has to be executed for every row (20k times in your example). This only makes sense for a small number of rows or where the computation requires it.

This derived table is computed once before we join to it:

UPDATE api_ticket t
SET    rank = tt.rnk
FROM  (
   SELECT tt.id
        , rank() OVER (PARTITION BY tt.event_id
                       ORDER BY tt.points_earned DESC) AS rnk
   FROM   api_ticket tt
   WHERE  tt.status <> 'x'
   AND    tt.event_id = <EVENT_ID>
   ) tt
WHERE t.id = tt.id
AND   t.rank <> tt.rnk;  -- avoid empty updates

Should be quite a bit faster. :)

Additional improvements

The last predicate rules out empty updates:

  • How do I (or can I) SELECT DISTINCT on multiple columns?

Only makes sense if the new rank can be the old rank at least occasionally. Else remove it.

We don't need to repeat AND t.status != 'x' in the outer query since we join on the PK column id it's the same value on both sides.
And the standard SQL inequality operator is <>, even if Postgres supports !=, too.

Push down the predicate event_id = <EVENT_ID> into the subquery as well. No need to compute numbers for any other event_id. This was handed down from the outer query in your original. In the rewritten query, we best apply it in the subquery altogether. Since we use PARTITION BY tt.event_id, that's not going to mess with ranks.

like image 69
Erwin Brandstetter Avatar answered Oct 19 '22 19:10

Erwin Brandstetter