Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to rank in postgres query

Tags:

postgresql

I'm trying to rank a subset of data within a table but I think I am doing something wrong. I cannot find much information about the rank() feature for postgres, maybe I'm looking in the wrong place. Either way:

I'd like to know the rank of an id that falls within a cluster of a table based on a date. My query is as follows:

select cluster_id,feed_id,pub_date,rank  from (select feed_id,pub_date,cluster_id,rank()      over (order by pub_date asc) from url_info)  as bar where cluster_id = 9876 and feed_id = 1234; 

I'm modeling this after the following stackoverflow post: postgres rank

The reason I think I am doing something wrong is that there are only 39 rows in url_info that are in cluster_id 9876 and this query ran for 10 minutes and never came back. (actually re-ran it for quite a while and it returned no results, yet there is a row in cluster 9876 for id 1234) I'm expecting this will tell me something like "id 1234 was 5th for the criteria given). It will return a relative rank according to my query constraints, correct?

This is postgres 8.4 btw.

like image 992
WildBill Avatar asked Apr 22 '12 04:04

WildBill


People also ask

What is over () in PostgreSQL?

The OVER clause determines exactly how the rows of the query are split up for processing by the window function. The PARTITION BY clause within OVER divides the rows into groups, or partitions, that share the same values of the PARTITION BY expression(s).

What is dense rank in PostgreSQL?

In PostgreSQL, the DENSE_RANK() function is used to assign a rank to each row within a partition of a result set, with no gaps in ranking values. The DENSE_RANK() assigns a rank to every row in each partition of a result set.

What is rank query?

The RANK() function is a window function that assigns a rank to each row within a partition of a result set. The rows within a partition that have the same values will receive the same rank. The rank of the first row within a partition is one.


1 Answers

By placing the rank() function in the subselect and not specifying a PARTITION BY in the over clause or any predicate in that subselect, your query is asking to produce a rank over the entire url_info table ordered by pub_date. This is likely why it ran so long as to rank over all of url_info, Pg must sort the entire table by pub_date, which will take a while if the table is very large.

It appears you want to generate a rank for just the set of records selected by the where clause, in which case, all you need do is eliminate the subselect and the rank function is implicitly over the set of records matching that predicate.

select    cluster_id  ,feed_id  ,pub_date  ,rank() over (order by pub_date asc) as rank from url_info where cluster_id = 9876 and feed_id = 1234; 

If what you really wanted was the rank within the cluster, regardless of the feed_id, you can rank in a subselect which filters to that cluster:

select ranked.* from (   select      cluster_id    ,feed_id    ,pub_date    ,rank() over (order by pub_date asc) as rank   from url_info   where cluster_id = 9876 ) as ranked where feed_id = 1234; 
like image 57
dbenhur Avatar answered Oct 04 '22 08:10

dbenhur