What is the R equivalent of rank functions like the Oracle <code>ROW_NUMBER()</code>, <code>RANK()</code>, or <code>DENSE_RANK()</code> ("assign integer values to the rows depending on their order"; see http://www.orafaq.com/node/55)? I agree that the functionality of each function can potentially be achieved in an ad-hoc fashion. But my main concern is the performance. It would be good to avoid using join or indexing access, for the sake of memory and speed.

From <code>data.table v1.9.5+</code>, function <code>frank()</code> (for fast rank) has been implemented. <code>frank()</code> is useful in interactive scenarios, where as <code>frankv()</code> allows to easily program with. It implements every operation available in <code>base::rank</code>. In addition, the advantages are: <ul> <li><code>frank()</code> operates on list, data.frames and data.tables in addition to atomic vectors.</li> <li>We can specify, for each column, whether rank should be computed on increasing or decreasing order. </li> <li>It also implements rank type <code>dense</code> in addition to other types in <code>base</code>.</li> <li>You can use <code>-</code> on a character column as well to rank by decreasing order.</li> </ul> Here's an illustration of all the above points using the same data.table <code>DT</code> from @BenBarnes' (excellent) post. <h3>data:</h3> <pre class="prettyprint"><code>require(data.table) set.seed(10) sample_n <- function(x, n) sample(x, n, replace=TRUE) DT <- data.table( ID = seq_len(4*3), group = rep(1:4,each=3), value = rnorm(4*3), info = c(sample_n(letters[1:2], 8), sample_n(letters[3:4], 4))) </code></pre> <h3>On single columns:</h3> <ul> <li> Compute <code>dense</code> rank: <pre class="prettyprint"><code>DT[, rank := frank(value, ties.method="dense"), by=group] </code></pre> </li> </ul> You can also use the other methods <code>min</code>, <code>max</code>, <code>random</code>, <code>average</code> and <code>first</code>. <ul> <li> In decreasing order: <pre class="prettyprint"><code>DT[, rank := frank(-value, ties.method="dense"), by=group] </code></pre> </li> <li> Using <code>frankv</code>, similar to <code>frank</code>: <pre class="prettyprint"><code># increasing order frankv(DT, "value", ties.method="dense") # decreasing order frankv(DT, "value", order=-1L, ties.method="dense") </code></pre> </li> </ul> <h3>On multiple columns</h3> You can use <code>.SD</code>, which stands for Subset of Data and contains the data corresponding to that group. See the Introduction to data.table HTML vignette for more on <code>.SD</code>. <ul> <li> Rank by <code>info, value</code> columns while grouping by <code>group</code>: <pre class="prettyprint"><code>DT[, rank := frank(.SD, info, value, ties.method="dense"), by=group] </code></pre> </li> <li> Use <code>-</code> to specify decreasing order: <pre class="prettyprint"><code>DT[, rank := frank(.SD, info, -value, ties.method="dense"), by=group] </code></pre> </li> <li> You can also use <code>-</code> directly on character columns <pre class="prettyprint"><code>DT[, rank := frank(.SD, -info, -value, ties.method="dense"), by=group] </code></pre> </li> </ul> You can use <code>frankv</code> similarly and provide the columns to <code>cols</code> argument and the order by which the columns should be ranked using the <code>order</code> argument. <hr> Small benchmark to compare with <code>base::rank</code>: <pre class="prettyprint"><code>set.seed(45L) x = sample(1e4, 1e7, TRUE) system.time(ans1 <- base::rank(x, ties.method="first")) # user system elapsed # 22.200 0.255 22.536 system.time(ans2 <- frank(x, ties.method="first")) # user system elapsed # 0.745 0.014 0.762 identical(ans1, ans2) # [1] TRUE </code></pre>

How to emulate SQLs rank functions in R?

Tags:

What is the R equivalent of rank functions like the Oracle ROW_NUMBER(), RANK(), or DENSE_RANK() ("assign integer values to the rows depending on their order"; see http://www.orafaq.com/node/55)?

I agree that the functionality of each function can potentially be achieved in an ad-hoc fashion. But my main concern is the performance. It would be good to avoid using join or indexing access, for the sake of memory and speed.

331

asked Jul 12 '12 06:07

Wei

2 Answers

The data.table package, especially starting with version 1.8.1, offers much of the functionality of partition in SQL terms. rank(x, ties.method = "min") in R is similar to Oracle RANK(), and there's a way using factors (described below) to mimic the DENSE_RANK() function. A way to mimic ROW_NUMBER should be obvious by the end.

Here's an example: Load the latest version of data.table from R-Forge:

install.packages("data.table",   repos= c("http://R-Forge.R-project.org", getOption("repos")))  library(data.table)

Create some example data:

set.seed(10)  DT<-data.table(ID=seq_len(4*3),group=rep(1:4,each=3),value=rnorm(4*3),   info=c(sample(c("a","b"),4*2,replace=TRUE),   sample(c("c","d"),4,replace=TRUE)),key="ID")  > DT     ID group       value info  1:  1     1  0.01874617    a  2:  2     1 -0.18425254    b  3:  3     1 -1.37133055    b  4:  4     2 -0.59916772    a  5:  5     2  0.29454513    b  6:  6     2  0.38979430    a  7:  7     3 -1.20807618    b  8:  8     3 -0.36367602    a  9:  9     3 -1.62667268    c 10: 10     4 -0.25647839    d 11: 11     4  1.10177950    c 12: 12     4  0.75578151    d

Rank each ID by decreasing value within group (note the - in front of value to denote decreasing order):

> DT[,valRank:=rank(-value),by="group"]     ID group       value info valRank  1:  1     1  0.01874617    a       1  2:  2     1 -0.18425254    b       2  3:  3     1 -1.37133055    b       3  4:  4     2 -0.59916772    a       3  5:  5     2  0.29454513    b       2  6:  6     2  0.38979430    a       1  7:  7     3 -1.20807618    b       2  8:  8     3 -0.36367602    a       1  9:  9     3 -1.62667268    c       3 10: 10     4 -0.25647839    d       3 11: 11     4  1.10177950    c       1 12: 12     4  0.75578151    d       2

For DENSE_RANK() with ties in the value being ranked, you could convert the value to a factor and then return the underlying integer values. For example, ranking each ID based on info within group (compare infoRank with infoRankDense):

DT[,infoRank:=rank(info,ties.method="min"),by="group"] DT[,infoRankDense:=as.integer(factor(info)),by="group"]  R> DT     ID group       value info valRank infoRank infoRankDense  1:  1     1  0.01874617    a       1        1             1  2:  2     1 -0.18425254    b       2        2             2  3:  3     1 -1.37133055    b       3        2             2  4:  4     2 -0.59916772    a       3        1             1  5:  5     2  0.29454513    b       2        3             2  6:  6     2  0.38979430    a       1        1             1  7:  7     3 -1.20807618    b       2        2             2  8:  8     3 -0.36367602    a       1        1             1  9:  9     3 -1.62667268    c       3        3             3 10: 10     4 -0.25647839    d       3        2             2 11: 11     4  1.10177950    c       1        1             1 12: 12     4  0.75578151    d       2        2             2

p.s. Hi Matthew Dowle.

LEAD and LAG

For imitating LEAD and LAG, start with the answer provided here. I would create a rank variable based on the order of IDs within groups. This wouldn't be necessary with the fake data as above, but if the IDs are not in sequential order within groups, then this would make life a bit more difficult. So here's some new fake data with non-sequential IDs:

set.seed(10)  DT<-data.table(ID=sample(seq_len(4*3)),group=rep(1:4,each=3),value=rnorm(4*3),   info=c(sample(c("a","b"),4*2,replace=TRUE),   sample(c("c","d"),4,replace=TRUE)),key="ID")  DT[,idRank:=rank(ID),by="group"] setkey(DT,group, idRank)  > DT     ID group       value info idRank  1:  4     1 -0.36367602    b      1  2:  5     1 -1.62667268    b      2  3:  7     1 -1.20807618    b      3  4:  1     2  1.10177950    a      1  5:  2     2  0.75578151    a      2  6: 12     2 -0.25647839    b      3  7:  3     3  0.74139013    c      1  8:  6     3  0.98744470    b      2  9:  9     3 -0.23823356    a      3 10:  8     4 -0.19515038    c      1 11: 10     4  0.08934727    c      2 12: 11     4 -0.95494386    c      3

Then to get the values of the previous 1 record, use the group and idRank variables and subtract 1 from the idRank and use the multi = 'last' argument. To get the value from the record two entries above, subtract 2.

DT[,prev:=DT[J(group,idRank-1), value, mult='last']] DT[,prev2:=DT[J(group,idRank-2), value, mult='last']]      ID group       value info idRank        prev      prev2  1:  4     1 -0.36367602    b      1          NA         NA  2:  5     1 -1.62667268    b      2 -0.36367602         NA  3:  7     1 -1.20807618    b      3 -1.62667268 -0.3636760  4:  1     2  1.10177950    a      1          NA         NA  5:  2     2  0.75578151    a      2  1.10177950         NA  6: 12     2 -0.25647839    b      3  0.75578151  1.1017795  7:  3     3  0.74139013    c      1          NA         NA  8:  6     3  0.98744470    b      2  0.74139013         NA  9:  9     3 -0.23823356    a      3  0.98744470  0.7413901 10:  8     4 -0.19515038    c      1          NA         NA 11: 10     4  0.08934727    c      2 -0.19515038         NA 12: 11     4 -0.95494386    c      3  0.08934727 -0.1951504

For LEAD, add the appropriate offset to the idRank variable and switch to multi = 'first':

DT[,nex:=DT[J(group,idRank+1), value, mult='first']] DT[,nex2:=DT[J(group,idRank+2), value, mult='first']]      ID group       value info idRank        prev      prev2         nex       nex2  1:  4     1 -0.36367602    b      1          NA         NA -1.62667268 -1.2080762  2:  5     1 -1.62667268    b      2 -0.36367602         NA -1.20807618         NA  3:  7     1 -1.20807618    b      3 -1.62667268 -0.3636760          NA         NA  4:  1     2  1.10177950    a      1          NA         NA  0.75578151 -0.2564784  5:  2     2  0.75578151    a      2  1.10177950         NA -0.25647839         NA  6: 12     2 -0.25647839    b      3  0.75578151  1.1017795          NA         NA  7:  3     3  0.74139013    c      1          NA         NA  0.98744470 -0.2382336  8:  6     3  0.98744470    b      2  0.74139013         NA -0.23823356         NA  9:  9     3 -0.23823356    a      3  0.98744470  0.7413901          NA         NA 10:  8     4 -0.19515038    c      1          NA         NA  0.08934727 -0.9549439 11: 10     4  0.08934727    c      2 -0.19515038         NA -0.95494386         NA 12: 11     4 -0.95494386    c      3  0.08934727 -0.1951504          NA         NA

191

answered Sep 21 '22 08:09

BenBarnes

From data.table v1.9.5+, function frank() (for fast rank) has been implemented. frank() is useful in interactive scenarios, where as frankv() allows to easily program with.

It implements every operation available in base::rank. In addition, the advantages are:

frank() operates on list, data.frames and data.tables in addition to atomic vectors.
We can specify, for each column, whether rank should be computed on increasing or decreasing order.
It also implements rank type dense in addition to other types in base.
You can use - on a character column as well to rank by decreasing order.

Here's an illustration of all the above points using the same data.table DT from @BenBarnes' (excellent) post.

data:

require(data.table) set.seed(10) sample_n <- function(x, n) sample(x, n, replace=TRUE) DT <- data.table(         ID = seq_len(4*3),         group = rep(1:4,each=3),         value = rnorm(4*3),         info = c(sample_n(letters[1:2], 8), sample_n(letters[3:4], 4)))

On single columns:

Compute dense rank:

DT[, rank := frank(value, ties.method="dense"), by=group]

You can also use the other methods min, max, random, average and first.

In decreasing order:

DT[, rank := frank(-value, ties.method="dense"), by=group]

Using frankv, similar to frank:

# increasing order frankv(DT, "value", ties.method="dense")  # decreasing order frankv(DT, "value", order=-1L, ties.method="dense")

On multiple columns

You can use .SD, which stands for Subset of Data and contains the data corresponding to that group. See the Introduction to data.table HTML vignette for more on .SD.

Rank by info, value columns while grouping by group:

DT[, rank := frank(.SD,  info, value, ties.method="dense"), by=group]

Use - to specify decreasing order:

DT[, rank := frank(.SD,  info, -value, ties.method="dense"), by=group]

You can also use - directly on character columns

DT[, rank := frank(.SD, -info, -value, ties.method="dense"), by=group]

You can use frankv similarly and provide the columns to cols argument and the order by which the columns should be ranked using the order argument.

Small benchmark to compare with base::rank:

set.seed(45L) x = sample(1e4, 1e7, TRUE) system.time(ans1 <- base::rank(x, ties.method="first")) #    user  system elapsed  #  22.200   0.255  22.536  system.time(ans2 <- frank(x, ties.method="first")) #    user  system elapsed  #   0.745   0.014   0.762  identical(ans1, ans2) # [1] TRUE

answered Sep 18 '22 08:09

Arun

Related questions
                            
                                Does linux allow any system call to be made from signal handlers?
                            
                                How to add Stretchable UIImage in the CALayer.contents?
                            
                                Expandable list view in Android doesn't have reaction to click, it doesn't expand
                            
                                Is there a ternary operator in handlebars.js?
                            
                                Generating Clojure code with macro containing type hints
                            
                                .NET 4.5 RTM on Azure?
                            
                                Backbone.js best practice for event handlers naming
                            
                                Different result when using GET/POST in elastic search
                            
                                CSS display vertical scrollbar on pre tag
                            
                                How can I import a CSV file via a rake task?
                            
                                get raw decimal value from mysqldb query
                            
                                Jquery autocomplete on select event

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With