Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select rows based on whether value of a columns is in top X of columns

Tags:

r

subset

I have a dataframe which I want to subset depending on whether a value in a certain column is among the top 5 values of all the columns in that row. This is a simplified version of my dataframe:

> my.df <- data.frame(a = rnorm(10,5), b= rnorm(10,5), c=rnorm(10,5), d=rnorm(10,5), e=rnorm(10,5))
> my.df
          a        b        c        d        e
1  6.401462 5.318849 5.373496 5.101140 3.710973
2  6.715845 4.786936 3.521965 4.264029 4.525138
3  6.076211 5.356114 5.605134 5.443002 5.296778
4  7.009623 5.275595 4.801874 4.355892 6.752737
5  5.002059 6.163398 6.063694 2.409702 6.172111
6  6.298305 3.291884 5.737053 4.701320 4.752406
7  4.856246 4.674743 5.550828 7.501786 5.466611
8  5.037990 4.129333 4.797334 5.143915 5.558161
9  4.903592 3.135622 5.879798 5.639893 4.368915
10 5.500374 4.400130 3.980433 6.203259 4.498614

Now I want only the rows where the values of column a or column b are among the top 2 values in their row. So in this example deleting row 7-9, giving:

          a        b        c        d        e
1  6.401462 5.318849 5.373496 5.101140 3.710973
2  6.715845 4.786936 3.521965 4.264029 4.525138
3  6.076211 5.356114 5.605134 5.443002 5.296778
4  7.009623 5.275595 4.801874 4.355892 6.752737
5  5.002059 6.163398 6.063694 2.409702 6.172111
6  6.298305 3.291884 5.737053 4.701320 4.752406
10 5.500374 4.400130 3.980433 6.203259 4.498614

Any ideas?

like image 472
joffie Avatar asked Mar 04 '23 09:03

joffie


1 Answers

We can loop through the rows with apply (from base R) check whether any of the elements in 'a' or 'b' are %in% the sorted group to create a logical index and subset the rows based on that

i1 <- apply(my.df, 1, function(x) any(x[1:2] %in% sort(x, decreasing = TRUE)[1:2]))
my.df[i1,]
#         a        b        c        d        e
#1  6.401462 5.318849 5.373496 5.101140 3.710973
#2  6.715845 4.786936 3.521965 4.264029 4.525138
#3  6.076211 5.356114 5.605134 5.443002 5.296778
#4  7.009623 5.275595 4.801874 4.355892 6.752737
#5  5.002059 6.163398 6.063694 2.409702 6.172111
#6  6.298305 3.291884 5.737053 4.701320 4.752406
#10 5.500374 4.400130 3.980433 6.203259 4.498614

Or use max.col from base R to create the logical index and that would be much faster and avoid any transformation

i1 <- max.col(my.df, "first")
i2 <- max.col(replace(my.df, cbind(seq_len(nrow(my.df)), i1), -Inf), "first")
my.df[(i1 %in% 1:2) | (i2 %in% 1:2), ]

data

my.df <- structure(list(a = c(6.401462, 6.715845, 6.076211, 7.009623, 
5.002059, 6.298305, 4.856246, 5.03799, 4.903592, 5.500374), b = c(5.318849, 
4.786936, 5.356114, 5.275595, 6.163398, 3.291884, 4.674743, 4.129333, 
3.135622, 4.40013), c = c(5.373496, 3.521965, 5.605134, 4.801874, 
6.063694, 5.737053, 5.550828, 4.797334, 5.879798, 3.980433), 
    d = c(5.10114, 4.264029, 5.443002, 4.355892, 2.409702, 4.70132, 
    7.501786, 5.143915, 5.639893, 6.203259), e = c(3.710973, 
    4.525138, 5.296778, 6.752737, 6.172111, 4.752406, 5.466611, 
    5.558161, 4.368915, 4.498614)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6", "7", "8", "9", "10"))
like image 166
akrun Avatar answered Mar 07 '23 01:03

akrun