I have two columns where the sum of each row is 1 (they are the probability of one of two classes). I need to find the column number where a condition is met.
C1 C2
0.4 0.6
0.3 0.7
1 0
0.7 0.3
0.1 0.9
For example, if I need to find the column where the number is >= 0.6, in the table above it should result in:
2
2
1
1
2
Thanks for this interesting question. Here is an idea using apply
.
apply(dat, 1, function(x) which(x >= 0.6))
# [1] 2 2 1 1 2
DATA
dat <- read.table(textConnection("C1 C2
0.4 0.6
0.3 0.7
1 0
0.7 0.3
0.1 0.9"), header = T)
Benchmarking
I conducted the benchmark for the original data frame dat
, and a data frame with 5000 rows dat2
. The results are as follows. I feel a little bit embarrassed that my apply
method is the slowest.
If anyone has any idea how to improve the way I conducted benchmark, please let me know.
library(microbenchmark)
# Benchmark 1
perf <- microbenchmark(m1 = {apply(dat, 1, function(x) which(x >= 0.6))},
m2 = {ifelse(dat$C1 <= 0.4, 2, 1)},
m3 = {(dat$C2 >= 0.6) + 1},
m4 = {(which(t(dat) >= 0.6) + 1) %% ncol(dat) + 1},
m5 = {((dat>=0.6) %*% c(1,2))[, 1]},
m6 = {m <- which(dat >= 0.6, arr.ind = TRUE)
m[order(m[, 1]), ][, 2]},
m7 = {max.col(dat >= 0.6)})
perf
# Unit: microseconds
# expr min lq mean median uq max neval
# m1 58.602 65.0280 88.34563 67.5985 70.6825 1746.246 100
# m2 9.253 12.8515 15.45772 13.8790 14.9080 49.349 100
# m3 4.112 5.6540 6.59015 6.1690 7.1970 23.132 100
# m4 30.844 35.7270 40.29682 38.0405 40.8670 134.683 100
# m5 23.647 26.7310 30.13404 27.7590 29.8160 77.109 100
# m6 49.863 53.4620 61.31148 56.5460 59.8875 168.610 100
# m7 37.012 40.0960 45.36537 42.1530 45.2370 97.671 100
# Benchmark 2
dat2 <- dat[rep(1:5, 1000), ]
perf2 <- microbenchmark(m1 = {apply(dat2, 1, function(x) which(x >= 0.6))},
m2 = {ifelse(dat2$C1 <= 0.4, 2, 1)},
m3 = {(dat2$C2 >= 0.6) + 1},
m4 = {(which(t(dat2) >= 0.6) + 1) %% ncol(dat2) + 1},
m5 = {((dat2 >= 0.6) %*% c(1,2))[, 1]},
m6 = {m <- which(dat2 >= 0.6, arr.ind = TRUE)
m[order(m[, 1]), ][, 2]},
m7 = {max.col(dat2 >= 0.6)})
perf2
# Unit: microseconds
# expr min lq mean median uq max neval
# m1 13842.995 14830.2380 17173.18941 15716.2125 16551.8095 165431.735 100
# m2 133.140 146.7630 168.86722 160.6420 179.9195 314.602 100
# m3 22.104 25.7030 31.93827 28.0160 33.9280 67.341 100
# m4 156.787 179.6620 212.97310 210.5055 234.6665 320.257 100
# m5 131.598 148.8195 173.42179 164.2410 189.9440 286.843 100
# m6 403.019 439.2600 496.25370 472.6735 549.0110 791.646 100
# m7 140.337 156.7870 270.48048 179.4055 208.9635 8631.503 100
You can make use of the fact that TRUE
= 1 and FALSE
= 0:
> df <- read.table(textConnection("C1 C2
+ 0.4 0.6
+ 0.3 0.7
+ 1 0
+ 0.7 0.3
+ 0.1 0.9"), header = T)
> (df$C2 >= 0.6) + 1
[1] 2 2 1 1 2
This is using matrix multiple
(dt>=0.6)%*%c(1,2)
[,1]
[1,] 2
[2,] 2
[3,] 1
[4,] 1
[5,] 2
If I consider the case where more than 1
column can satisfy the condition then which
will be a better option.
I have modified the data so that both column 1
and 2
satisfy the condition in row 3
.
# Data
df <- read.table(text = "C1 C2
0.4 0.6
0.3 0.7
1 1
0.7 0.3
0.1 0.9", header = T, stringsAsFactors = F)
# Use of which with arr.ind = TRUE
which(df >= 0.6, arr.ind = TRUE)
# Result shows row number 3 twice
# row col
#[1,] 3 1
#[2,] 4 1
#[3,] 1 2
#[4,] 2 2
#[5,] 3 2
#[6,] 5 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With