Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extracting indices for data frame rows that have MAX value for named field

Tags:

r

I have a data frame that is rather large and I need a good way (explained bellow) to extract indices for rows that have maximum values for a given field, within a certain set of labels. To explain this a bit better, here is an example 10 row data frame:

      value label
1  5.531637     D
2  5.826498     A
3  8.866210     A
4  1.387978     C
5  8.128505     C
6  7.391311     B
7  1.829392     A
8  4.373273     D
9  7.380244     A
10 6.157304     D

To generate:

structure(list(value = c(5.531637, 5.826498, 8.86621, 1.387978, 8.128505, 
7.391311, 1.829392, 4.373273, 7.380244, 6.157304), 
label = c("D", "A", "A", "C", "C", "B", "A", "D", "A", "D")), 
.Names = c("value", "label"), class = "data.frame", row.names = c(NA, -10L))

If I want to know what the index is for rows that have the maximum value per label, I currently use the following code:

idx <- sapply(split(1:nrow(d), d$label), function(x) {
  x[which.max(d[x,"value"])]
})

Generating this answer:

A  B  C  D 
3  6  5 10

I have also played around with ddply but have yet to find a better way to do this. By "better" in this case I mean faster (ddply is pretty slow and what I currently use is not far behind) as well as more elegant since the above solution seems way to wordy too me.

like image 685
diliop Avatar asked Feb 03 '23 18:02

diliop


1 Answers

First of all: you can get the speed up using:

idx <- sapply(split(seq_len(nrow(d)), d$label), function(x) {
      x[which.max(d$value[x])]})

For a 100k data.frame, on my machine it is 5x faster than d[x,"value"] version.

For a large data.frame and many labels you could use a similar method that I posted in earlier question:

dd <- d[i<-order(d$label, d$value),] # dd is sorted by label and value
ind <- c(dd$label[-1] != dd$label[-n], TRUE)
idx <- setNames(seq_len(nrow(d))[i][ind], dd$label[ind])

edit: A more efficient solution with the use of a trick from Martin Morgan answer:

v <- d$label[i<-order(d$value)] # we need only label, and with Martin
                                # trick sorting over label is not needed
ind <- !duplicated(v, fromLast=TRUE) # it finds last (max) occurrence of label
idx <- setNames(seq_len(nrow(d))[i][ind], v[ind])

NOTE: order of final vector is different.

It depends on your actual data structure but you should gain a nice speed-up:

Timings:

# NOTE: different machine, so timing differ from previous
set.seed(6025051)
n <- 100000; k <- 20000
d <- data.frame(value=rnorm(n), 
    label=sample(paste("A",seq_len(k),sep="_"), n, replace=TRUE))

system.time(
    idx_1 <- sapply(split(1:nrow(d), d$label), function(x) {
        x[which.max(d[x,"value"])]})
)
# user  system elapsed 
# 1.30    0.02    1.31 
system.time(
    idx_1b <- sapply(split(seq_len(nrow(d)), d$label), function(x) {
        x[which.max(d$value[x])]})
)
# user  system elapsed 
# 0.23    0.00    0.23
all.equal(idx_1, idx_1b)
# [1] TRUE
system.time({
    dd <- d[i<-order(d$label, d$value),]
    ind <- c(dd$label[-1] != dd$label[-n], TRUE)
    idx_2 <- setNames(seq_len(nrow(d))[i][ind],dd$label[ind])
})
# user  system elapsed 
# 0.19    0.00    0.19 
all.equal(idx_1, idx_2)
# [1] TRUE

new solution

system.time({
    v <- d$label[i<-order(d$value)]
            ind <- !duplicated(v, fromLast=TRUE)
            idx_3 <- setNames(seq_len(nrow(d))[i][ind], v[ind])
})
# user  system elapsed 
# 0.05    0.00    0.04 
all.equal(sort(idx_1), sort(idx_3))
# [1] TRUE
like image 134
Marek Avatar answered Feb 05 '23 14:02

Marek