Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Slow GroupAggregate in PostgreSQL

In PostgreSQL 9.2, I have a table of items that are being rated by users:

   id   | userid | itemid |    rating     |      timestamp      |      !update_time
--------+--------+--------+---------------+---------------------+------------------------
 522241 | 3991   | 6887   |  0.1111111111 | 2005-06-20 03:13:56 | 2013-10-11 17:50:24.545
 522242 | 3991   | 6934   |  0.1111111111 | 2005-04-05 02:25:21 | 2013-10-11 17:50:24.545
 522243 | 3991   | 6936   | -0.1111111111 | 2005-03-31 03:17:25 | 2013-10-11 17:50:24.545
 522244 | 3991   | 6942   | -0.3333333333 | 2005-03-24 04:38:02 | 2013-10-11 17:50:24.545
 522245 | 3991   | 6951   | -0.5555555556 | 2005-06-20 03:15:35 | 2013-10-11 17:50:24.545
 ...    | ...    | ...    | ...           | ...                 | ...

I want to perform a very simple query: for each user, select the total number of ratings in the database.

I'm using the following straightforward approach:

SELECT userid, COUNT(*) AS rcount
FROM ratings
GROUP BY userid

The table contains 10M records. The query takes... well, about 2 or 3 minutes. Honestly, I'm not satisfied with that, and I believe that 10M is not so large number for the query to take so long. (Or is it..??)

Henceforth, I asked PostgreSQL to show me the execution plan:

EXPLAIN SELECT userid, COUNT(*) AS rcount
FROM ratings
GROUP BY userid

This results in:

GroupAggregate  (cost=1756177.54..1831423.30 rows=24535 width=5)
  ->  Sort  (cost=1756177.54..1781177.68 rows=10000054 width=5)
      Sort Key: userid
      ->  Seq Scan on ratings  (cost=0.00..183334.54 rows=10000054 width=5)

I read this as follows: Firstly, the whole table is read from the disk (seq scan). Secondly, it is sorted by userid in n*log(n) (sort). Finally, the sorted table is read row-by-row and aggregated in linear time. Well, not exactly the optimal algorithm I think, if I were to implement it by myself, I would use a hash table and build the result in the first pass. Never mind.

It seems that it is the sorting by userid which takes so long. So added an index:

CREATE INDEX ratings_userid_index ON ratings (userid)

Unfortunately, this didn't help and the performance remained the same. I definitely do not consider myself an advanced user and I believe I'm doing something fundamentally wrong. However, this is where I got stuck. I would appreciate any ideas how to make the query execute in reasonable time. One more note: PostgreSQL worker process utilizes 100 % of one of my CPU cores during the execution, suggesting that disk access is not the main bottleneck.

EDIT

As requested by @a_horse_with_no_name. Wow, quite advanced for me:

EXPLAIN (analyze on, buffers on, verbose on)
SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId

Outputs:

GroupAggregate  (cost=1756177.54..1831423.30 rows=24535 width=5) (actual time=110666.899..127168.304 rows=69878 loops=1)
  Output: userid, count(userid)
  Buffers: shared hit=906 read=82433, temp read=19358 written=19358
  ->  Sort  (cost=1756177.54..1781177.68 rows=10000054 width=5) (actual time=110666.838..125180.683 rows=10000054 loops=1)
        Output: userid
        Sort Key: ratings.userid
        Sort Method: external merge  Disk: 154840kB
        Buffers: shared hit=906 read=82433, temp read=19358 written=19358
        ->  Seq Scan on movielens_10m.ratings  (cost=0.00..183334.54 rows=10000054 width=5) (actual time=0.019..2889.583 rows=10000054 loops=1)
              Output: userid
              Buffers: shared hit=901 read=82433
Total runtime: 127193.524 ms

EDIT 2

@a_horse_with_no_name's comment solved the problem. I feel happy to share my findings:

SET work_mem = '1MB';
EXPLAIN SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId

produces the same as above:

GroupAggregate  (cost=1756177.54..1831423.30 rows=24535 width=5)
  ->  Sort  (cost=1756177.54..1781177.68 rows=10000054 width=5)
      Sort Key: userid
      ->  Seq Scan on ratings  (cost=0.00..183334.54 rows=10000054 width=5)

However,

SET work_mem = '10MB';
EXPLAIN SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId

gives

HashAggregate  (cost=233334.81..233580.16 rows=24535 width=5)
  ->  Seq Scan on ratings  (cost=0.00..183334.54 rows=10000054 width=5)

The query now only takes about 3.5 seconds to complete.

like image 887
Tregoreg Avatar asked Dec 14 '13 05:12

Tregoreg


1 Answers

Consider how your query could possibly return a result... You could build a variable-length hash and create/increment its values; or you could sort all rows by userid and count. Computationally, the latter option is cheaper. That is what Postgres does.

Then consider how to sort the data, taking disk IO into account. One option is to open disk pages A, B, C, D, etc., and then sorting rows by userid in memory. In other words, seq scan followed by a sort. The other option, called an index scan, would be to pull rows in order by using an index: visit page B, then D, then A, then B again, A again, C, ad nausea.

An index scan is efficient when pulling a handful of rows in order; not so much to fetch many rows in order — let alone all rows in order. As such, the plan you're getting is the optimal one:

  1. Plough throw all rows (seq scan)
  2. Sort rows to group by criteria
  3. Count rows by criteria

Trouble is, you're sorting roughly 10 million rows in order to count them by userid. Nothing will make things faster short of investing in more RAM and super fast SSDs.

You can, however, avoid this query altogether. Either:

  • Count ratings for the handful of users that you actually need — using a where clause — instead of pulling the entire set; or
  • Add a ratings_count field to your users table and use triggers on ratings to maintain the count.
  • Use a materialized view, if the precise count is less relevant than having a vague idea of it.
like image 144
Denis de Bernardy Avatar answered Oct 14 '22 15:10

Denis de Bernardy