I want to remove duplicated rows from a dataframe stratified by different fators and conditions, such as highest mean or sd.
Some data, a
is the factor and the id for the rows.
set.seed(13654)
a<- sort(c(1,1,4,1,2,3,2,3,1,5))
b<- matrix(runif(100,min = 6,max = 14),nrow = 10)
c<- data.frame(a,b)
For example I want to reduce the final dataset on the rows with the highest mean value.
# calculate means per row
gr <- cbind(a,M=rowMeans(c[,-1]))
# get rows stratified by a with highest mean:
gr1 <- aggregate(M~a,gr,which.max)
gr1
a M
1 1 3
2 2 2
3 3 1
4 4 1
5 5 1
Thus, the third row of the factor level 1, the second row of the factor level 2, ... should be included in the new dataframe. I want to avoid loops. What I tried is to split
the data and then use lapply
, but didn't worked so far.
cl <- split(c,a)
# this function does not work it will select not the correct rows.
lapply(cl, "[", gr1, )
My final goal is a function like this:
remove.dupl <- function(data,factor,method=c(highest.mean,highest.sd,lowest.sd,...))
Can you provide some tipps or a solution for my problem. Following my workflow I need a "How-to" to use "["
correctly with lapply to select different rows from a list of dataframes.
Try the by()
function:
set.seed(13654)
a <- sort(c(1,1,4,1,2,3,2,3,1,5))
b <- matrix(runif(100,min = 6,max = 14),nrow = 10)
c <- data.frame(a,b)
myfun <- function(x) which.max(rowMeans(x)) # just replicating your example, you could define other functions here
d <- by(data = c, INDICES = c$a, function(x) x[myfun(x), ]) # use by() to select rows, based on myfun()
d <- do.call(rbind, d) # turn result of by() function into a data frame
Using the data.table package, I would approach it as follows:
library(data.table)
# method 1:
setDT(cc)[, `:=` (rn = 1:.N, wm = which.max(rowMeans(.SD))), a][rn==wm]
# method 2:
setDT(cc)[, wm := frank(1/rowMeans(.SD), ties.method="first"), a][wm==1]
which gives:
a X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 wm rn
1: 1 13.946254 7.302729 9.406389 8.924367 8.129423 10.174735 6.547805 11.618872 12.84100 9.494790 3 3
2: 2 13.606555 12.798149 11.261258 12.991822 12.875935 11.199411 8.551149 10.377451 13.63219 13.643163 2 2
3: 3 6.820769 13.748507 11.630297 11.559873 6.196406 8.925419 11.230415 10.584249 10.41442 6.821673 1 1
4: 4 8.418767 10.673998 6.693021 11.101287 7.855519 9.106210 12.279536 6.925023 6.92334 10.279204 1 1
5: 5 11.529072 7.940031 10.746172 8.535466 13.703122 12.294424 11.362498 11.256843 13.95535 13.264835 1 1
In base R you could do:
cc$rm <- apply(cc[,-1], 1, mean)
cc$wm <- ave(cc$rm, cc$a, FUN = function(x) max(x)==x)
cc[cc$wm == 1,]
which gives:
a X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 rm wm
3 1 13.946254 7.302729 9.406389 8.924367 8.129423 10.174735 6.547805 11.618872 12.84100 9.494790 9.838637 1
6 2 13.606555 12.798149 11.261258 12.991822 12.875935 11.199411 8.551149 10.377451 13.63219 13.643163 12.093708 1
7 3 6.820769 13.748507 11.630297 11.559873 6.196406 8.925419 11.230415 10.584249 10.41442 6.821673 9.793203 1
9 4 8.418767 10.673998 6.693021 11.101287 7.855519 9.106210 12.279536 6.925023 6.92334 10.279204 9.025591 1
10 5 11.529072 7.940031 10.746172 8.535466 13.703122 12.294424 11.362498 11.256843 13.95535 13.264835 11.458781 1
In response to your comment: As an alternative, you can use the rank
function inside ave
:
# duplicate the row for which 'max(x)==x' for the first group
cc <- rbind(cc,cc[3,])
cc$wm2 <- ave(cc$rm, cc$a, FUN = function(x) rank(-x, ties.method = "first"))
cc[cc$wm2 == 1,]
which gives:
a X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 rm wm wm2
3 1 13.946254 7.302729 9.406389 8.924367 8.129423 10.174735 6.547805 11.618872 12.84100 9.494790 9.838637 1 1
6 2 13.606555 12.798149 11.261258 12.991822 12.875935 11.199411 8.551149 10.377451 13.63219 13.643163 12.093708 1 1
7 3 6.820769 13.748507 11.630297 11.559873 6.196406 8.925419 11.230415 10.584249 10.41442 6.821673 9.793203 1 1
9 4 8.418767 10.673998 6.693021 11.101287 7.855519 9.106210 12.279536 6.925023 6.92334 10.279204 9.025591 1 1
10 5 11.529072 7.940031 10.746172 8.535466 13.703122 12.294424 11.362498 11.256843 13.95535 13.264835 11.458781 1 1
NOTE: I renamed the dataframe to cc
as it better not to use a function-name as name for your dataframe
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With