Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to emulate SQLs rank functions in R?

Tags:

What is the R equivalent of rank functions like the Oracle ROW_NUMBER(), RANK(), or DENSE_RANK() ("assign integer values to the rows depending on their order"; see http://www.orafaq.com/node/55)?

I agree that the functionality of each function can potentially be achieved in an ad-hoc fashion. But my main concern is the performance. It would be good to avoid using join or indexing access, for the sake of memory and speed.

like image 331
Wei Avatar asked Jul 12 '12 06:07

Wei


People also ask

How do you write a rank function in SQL?

The rank of a row is one plus the number of ranks that come before the row in question. ROW_NUMBER and RANK are similar. ROW_NUMBER numbers all rows sequentially (for example 1, 2, 3, 4, 5). RANK provides the same numeric value for ties (for example 1, 2, 2, 4, 5).

What is the difference between the rank () and Dense_rank () functions SQL?

rank and dense_rank are similar to row_number , but when there are ties, they will give the same value to the tied values. rank will keep the ranking, so the numbering may go 1, 2, 2, 4 etc, whereas dense_rank will never give any gaps.

What is the difference between ROW_NUMBER () and Rank ()?

The row_number gives continuous numbers, while rank and dense_rank give the same rank for duplicates, but the next number in rank is as per continuous order so you will see a jump but in dense_rank doesn't have any gap in rankings.

What is Dense_rank in R?

dense_rank: dense_rank Window function: returns the rank of rows within a window partition, without any gaps. The difference between rank and dense_rank is that dense_rank leaves no gaps in ranking sequence when there are ties.


2 Answers

The data.table package, especially starting with version 1.8.1, offers much of the functionality of partition in SQL terms. rank(x, ties.method = "min") in R is similar to Oracle RANK(), and there's a way using factors (described below) to mimic the DENSE_RANK() function. A way to mimic ROW_NUMBER should be obvious by the end.

Here's an example: Load the latest version of data.table from R-Forge:

install.packages("data.table",   repos= c("http://R-Forge.R-project.org", getOption("repos")))  library(data.table) 

Create some example data:

set.seed(10)  DT<-data.table(ID=seq_len(4*3),group=rep(1:4,each=3),value=rnorm(4*3),   info=c(sample(c("a","b"),4*2,replace=TRUE),   sample(c("c","d"),4,replace=TRUE)),key="ID")  > DT     ID group       value info  1:  1     1  0.01874617    a  2:  2     1 -0.18425254    b  3:  3     1 -1.37133055    b  4:  4     2 -0.59916772    a  5:  5     2  0.29454513    b  6:  6     2  0.38979430    a  7:  7     3 -1.20807618    b  8:  8     3 -0.36367602    a  9:  9     3 -1.62667268    c 10: 10     4 -0.25647839    d 11: 11     4  1.10177950    c 12: 12     4  0.75578151    d 

Rank each ID by decreasing value within group (note the - in front of value to denote decreasing order):

> DT[,valRank:=rank(-value),by="group"]     ID group       value info valRank  1:  1     1  0.01874617    a       1  2:  2     1 -0.18425254    b       2  3:  3     1 -1.37133055    b       3  4:  4     2 -0.59916772    a       3  5:  5     2  0.29454513    b       2  6:  6     2  0.38979430    a       1  7:  7     3 -1.20807618    b       2  8:  8     3 -0.36367602    a       1  9:  9     3 -1.62667268    c       3 10: 10     4 -0.25647839    d       3 11: 11     4  1.10177950    c       1 12: 12     4  0.75578151    d       2 

For DENSE_RANK() with ties in the value being ranked, you could convert the value to a factor and then return the underlying integer values. For example, ranking each ID based on info within group (compare infoRank with infoRankDense):

DT[,infoRank:=rank(info,ties.method="min"),by="group"] DT[,infoRankDense:=as.integer(factor(info)),by="group"]  R> DT     ID group       value info valRank infoRank infoRankDense  1:  1     1  0.01874617    a       1        1             1  2:  2     1 -0.18425254    b       2        2             2  3:  3     1 -1.37133055    b       3        2             2  4:  4     2 -0.59916772    a       3        1             1  5:  5     2  0.29454513    b       2        3             2  6:  6     2  0.38979430    a       1        1             1  7:  7     3 -1.20807618    b       2        2             2  8:  8     3 -0.36367602    a       1        1             1  9:  9     3 -1.62667268    c       3        3             3 10: 10     4 -0.25647839    d       3        2             2 11: 11     4  1.10177950    c       1        1             1 12: 12     4  0.75578151    d       2        2             2 

p.s. Hi Matthew Dowle.


LEAD and LAG

For imitating LEAD and LAG, start with the answer provided here. I would create a rank variable based on the order of IDs within groups. This wouldn't be necessary with the fake data as above, but if the IDs are not in sequential order within groups, then this would make life a bit more difficult. So here's some new fake data with non-sequential IDs:

set.seed(10)  DT<-data.table(ID=sample(seq_len(4*3)),group=rep(1:4,each=3),value=rnorm(4*3),   info=c(sample(c("a","b"),4*2,replace=TRUE),   sample(c("c","d"),4,replace=TRUE)),key="ID")  DT[,idRank:=rank(ID),by="group"] setkey(DT,group, idRank)  > DT     ID group       value info idRank  1:  4     1 -0.36367602    b      1  2:  5     1 -1.62667268    b      2  3:  7     1 -1.20807618    b      3  4:  1     2  1.10177950    a      1  5:  2     2  0.75578151    a      2  6: 12     2 -0.25647839    b      3  7:  3     3  0.74139013    c      1  8:  6     3  0.98744470    b      2  9:  9     3 -0.23823356    a      3 10:  8     4 -0.19515038    c      1 11: 10     4  0.08934727    c      2 12: 11     4 -0.95494386    c      3 

Then to get the values of the previous 1 record, use the group and idRank variables and subtract 1 from the idRank and use the multi = 'last' argument. To get the value from the record two entries above, subtract 2.

DT[,prev:=DT[J(group,idRank-1), value, mult='last']] DT[,prev2:=DT[J(group,idRank-2), value, mult='last']]      ID group       value info idRank        prev      prev2  1:  4     1 -0.36367602    b      1          NA         NA  2:  5     1 -1.62667268    b      2 -0.36367602         NA  3:  7     1 -1.20807618    b      3 -1.62667268 -0.3636760  4:  1     2  1.10177950    a      1          NA         NA  5:  2     2  0.75578151    a      2  1.10177950         NA  6: 12     2 -0.25647839    b      3  0.75578151  1.1017795  7:  3     3  0.74139013    c      1          NA         NA  8:  6     3  0.98744470    b      2  0.74139013         NA  9:  9     3 -0.23823356    a      3  0.98744470  0.7413901 10:  8     4 -0.19515038    c      1          NA         NA 11: 10     4  0.08934727    c      2 -0.19515038         NA 12: 11     4 -0.95494386    c      3  0.08934727 -0.1951504 

For LEAD, add the appropriate offset to the idRank variable and switch to multi = 'first':

DT[,nex:=DT[J(group,idRank+1), value, mult='first']] DT[,nex2:=DT[J(group,idRank+2), value, mult='first']]      ID group       value info idRank        prev      prev2         nex       nex2  1:  4     1 -0.36367602    b      1          NA         NA -1.62667268 -1.2080762  2:  5     1 -1.62667268    b      2 -0.36367602         NA -1.20807618         NA  3:  7     1 -1.20807618    b      3 -1.62667268 -0.3636760          NA         NA  4:  1     2  1.10177950    a      1          NA         NA  0.75578151 -0.2564784  5:  2     2  0.75578151    a      2  1.10177950         NA -0.25647839         NA  6: 12     2 -0.25647839    b      3  0.75578151  1.1017795          NA         NA  7:  3     3  0.74139013    c      1          NA         NA  0.98744470 -0.2382336  8:  6     3  0.98744470    b      2  0.74139013         NA -0.23823356         NA  9:  9     3 -0.23823356    a      3  0.98744470  0.7413901          NA         NA 10:  8     4 -0.19515038    c      1          NA         NA  0.08934727 -0.9549439 11: 10     4  0.08934727    c      2 -0.19515038         NA -0.95494386         NA 12: 11     4 -0.95494386    c      3  0.08934727 -0.1951504          NA         NA 
like image 191
BenBarnes Avatar answered Sep 21 '22 08:09

BenBarnes


From data.table v1.9.5+, function frank() (for fast rank) has been implemented. frank() is useful in interactive scenarios, where as frankv() allows to easily program with.

It implements every operation available in base::rank. In addition, the advantages are:

  • frank() operates on list, data.frames and data.tables in addition to atomic vectors.

  • We can specify, for each column, whether rank should be computed on increasing or decreasing order.

  • It also implements rank type dense in addition to other types in base.

  • You can use - on a character column as well to rank by decreasing order.

Here's an illustration of all the above points using the same data.table DT from @BenBarnes' (excellent) post.

data:

require(data.table) set.seed(10) sample_n <- function(x, n) sample(x, n, replace=TRUE) DT <- data.table(         ID = seq_len(4*3),         group = rep(1:4,each=3),         value = rnorm(4*3),         info = c(sample_n(letters[1:2], 8), sample_n(letters[3:4], 4))) 

On single columns:

  • Compute dense rank:

    DT[, rank := frank(value, ties.method="dense"), by=group] 

You can also use the other methods min, max, random, average and first.

  • In decreasing order:

    DT[, rank := frank(-value, ties.method="dense"), by=group] 
  • Using frankv, similar to frank:

    # increasing order frankv(DT, "value", ties.method="dense")  # decreasing order frankv(DT, "value", order=-1L, ties.method="dense") 

On multiple columns

You can use .SD, which stands for Subset of Data and contains the data corresponding to that group. See the Introduction to data.table HTML vignette for more on .SD.

  • Rank by info, value columns while grouping by group:

    DT[, rank := frank(.SD,  info, value, ties.method="dense"), by=group] 
  • Use - to specify decreasing order:

    DT[, rank := frank(.SD,  info, -value, ties.method="dense"), by=group] 
  • You can also use - directly on character columns

    DT[, rank := frank(.SD, -info, -value, ties.method="dense"), by=group] 

You can use frankv similarly and provide the columns to cols argument and the order by which the columns should be ranked using the order argument.


Small benchmark to compare with base::rank:

set.seed(45L) x = sample(1e4, 1e7, TRUE) system.time(ans1 <- base::rank(x, ties.method="first")) #    user  system elapsed  #  22.200   0.255  22.536  system.time(ans2 <- frank(x, ties.method="first")) #    user  system elapsed  #   0.745   0.014   0.762  identical(ans1, ans2) # [1] TRUE 
like image 30
Arun Avatar answered Sep 18 '22 08:09

Arun