In PostgreSQL 9.2, I have a table of items that are being rated by users:
id | userid | itemid | rating | timestamp | !update_time
--------+--------+--------+---------------+---------------------+------------------------
522241 | 3991 | 6887 | 0.1111111111 | 2005-06-20 03:13:56 | 2013-10-11 17:50:24.545
522242 | 3991 | 6934 | 0.1111111111 | 2005-04-05 02:25:21 | 2013-10-11 17:50:24.545
522243 | 3991 | 6936 | -0.1111111111 | 2005-03-31 03:17:25 | 2013-10-11 17:50:24.545
522244 | 3991 | 6942 | -0.3333333333 | 2005-03-24 04:38:02 | 2013-10-11 17:50:24.545
522245 | 3991 | 6951 | -0.5555555556 | 2005-06-20 03:15:35 | 2013-10-11 17:50:24.545
... | ... | ... | ... | ... | ...
I want to perform a very simple query: for each user, select the total number of ratings in the database.
I'm using the following straightforward approach:
SELECT userid, COUNT(*) AS rcount
FROM ratings
GROUP BY userid
The table contains 10M records. The query takes... well, about 2 or 3 minutes. Honestly, I'm not satisfied with that, and I believe that 10M is not so large number for the query to take so long. (Or is it..??)
Henceforth, I asked PostgreSQL to show me the execution plan:
EXPLAIN SELECT userid, COUNT(*) AS rcount
FROM ratings
GROUP BY userid
This results in:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5)
-> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5)
Sort Key: userid
-> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
I read this as follows: Firstly, the whole table is read from the disk (seq scan). Secondly, it is sorted by userid in n*log(n)
(sort). Finally, the sorted table is read row-by-row and aggregated in linear time. Well, not exactly the optimal algorithm I think, if I were to implement it by myself, I would use a hash table and build the result in the first pass. Never mind.
It seems that it is the sorting by userid
which takes so long. So added an index:
CREATE INDEX ratings_userid_index ON ratings (userid)
Unfortunately, this didn't help and the performance remained the same. I definitely do not consider myself an advanced user and I believe I'm doing something fundamentally wrong. However, this is where I got stuck. I would appreciate any ideas how to make the query execute in reasonable time. One more note: PostgreSQL worker process utilizes 100 % of one of my CPU cores during the execution, suggesting that disk access is not the main bottleneck.
EDIT
As requested by @a_horse_with_no_name. Wow, quite advanced for me:
EXPLAIN (analyze on, buffers on, verbose on)
SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId
Outputs:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5) (actual time=110666.899..127168.304 rows=69878 loops=1)
Output: userid, count(userid)
Buffers: shared hit=906 read=82433, temp read=19358 written=19358
-> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5) (actual time=110666.838..125180.683 rows=10000054 loops=1)
Output: userid
Sort Key: ratings.userid
Sort Method: external merge Disk: 154840kB
Buffers: shared hit=906 read=82433, temp read=19358 written=19358
-> Seq Scan on movielens_10m.ratings (cost=0.00..183334.54 rows=10000054 width=5) (actual time=0.019..2889.583 rows=10000054 loops=1)
Output: userid
Buffers: shared hit=901 read=82433
Total runtime: 127193.524 ms
EDIT 2
@a_horse_with_no_name's comment solved the problem. I feel happy to share my findings:
SET work_mem = '1MB';
EXPLAIN SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId
produces the same as above:
GroupAggregate (cost=1756177.54..1831423.30 rows=24535 width=5)
-> Sort (cost=1756177.54..1781177.68 rows=10000054 width=5)
Sort Key: userid
-> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
However,
SET work_mem = '10MB';
EXPLAIN SELECT userid,COUNT(userid) AS rcount
FROM movielens_10m.ratings
GROUP BY userId
gives
HashAggregate (cost=233334.81..233580.16 rows=24535 width=5)
-> Seq Scan on ratings (cost=0.00..183334.54 rows=10000054 width=5)
The query now only takes about 3.5 seconds to complete.
Consider how your query could possibly return a result... You could build a variable-length hash and create/increment its values; or you could sort all rows by userid and count. Computationally, the latter option is cheaper. That is what Postgres does.
Then consider how to sort the data, taking disk IO into account. One option is to open disk pages A, B, C, D, etc., and then sorting rows by userid in memory. In other words, seq scan followed by a sort. The other option, called an index scan, would be to pull rows in order by using an index: visit page B, then D, then A, then B again, A again, C, ad nausea.
An index scan is efficient when pulling a handful of rows in order; not so much to fetch many rows in order — let alone all rows in order. As such, the plan you're getting is the optimal one:
Trouble is, you're sorting roughly 10 million rows in order to count them by userid. Nothing will make things faster short of investing in more RAM and super fast SSDs.
You can, however, avoid this query altogether. Either:
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With