I'm working on a test coverage analysis and I would like to rearrange a matrix so that the columns are ordered by number of "additional" test failures.
As an example I have a matrix with TRUE and FALSE where TRUE indicates a failure.
df <- structure(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE), .Dim = c(10L, 3L), .Dimnames = list(NULL, c("t1", "t2", "t3")))
t2 has the highest number of failures and should be the first column. t1 has the next highest but all its failures (per row) are covered by t2. t3 however has fewer failures but the last two failures are not covered by t2 thus should be the second column.
Desired column order based on fail coverage:
df <- structure(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE), .Dim = c(10L, 3L), .Dimnames = list(NULL, c("t2", "t3", "t1")))
I was able to get a count of "additional" fails per test using a for loop in conjunction with apply function but performance is really bad when there is a lot of columns and rows in the data set. I do however prefer to rearrange the column for further processing.
for (n in 2:ncol(out)) {
idx <- which.max(apply(out, 2, sum, na.rm = T))
col.list <- c(col.list, names(idx))
val.list <- c(val.list, sum(out.2[ ,idx], na.rm = T))
out[out[ ,idx] == T, ] <- F
out <- out[ ,-idx]
}
Can anyone suggest a better approach to do this? Maybe not using a for loop?
Thanks.
Here's a somewhat similar approach to OP's but I hope it will perform slightly better (not tested though):
select_cols <- names(tail(sort(colSums(df)), 1)) # first col
for(i in seq_len(ncol(df)-1)) {
remaining_cols <- setdiff(colnames(df), select_cols)
idx <- rowSums(df[, select_cols, drop=FALSE]) > 0
select_cols <- c(select_cols,
names(tail(sort(colSums(df[!idx, remaining_cols, drop=FALSE])), 1)))
}
df <- df[, select_cols]
df
# t2 t3 t1
# [1,] TRUE FALSE TRUE
# [2,] TRUE FALSE TRUE
# [3,] TRUE FALSE TRUE
# [4,] TRUE FALSE TRUE
# [5,] TRUE FALSE TRUE
# [6,] TRUE FALSE TRUE
# [7,] TRUE FALSE FALSE
# [8,] TRUE TRUE FALSE
# [9,] FALSE TRUE FALSE
# [10,] FALSE TRUE FALSE
Update: try this slightly modified version - it is a lot faster and I think it will produce correct results:
select_cols <- names(tail(sort(colSums(m)), 1)) # first col
idx <- rowSums(m[, select_cols, drop = FALSE]) > 0
for(i in seq_len(ncol(m)-1)) {
remaining_cols <- setdiff(colnames(m), select_cols)
idx[!idx] <- rowSums(m[!idx, select_cols, drop=FALSE]) > 0
select_cols <- c(select_cols,
names(tail(sort(colSums(m[!idx, remaining_cols, drop=FALSE])), 1)))
}
m <- m[, select_cols]
m
The main difference between the two is this line:
idx[!idx] <- rowSums(m[!idx, select_cols, drop=FALSE]) > 0
which means we don't need to compute rowSums for rows where any previously selected column is already true.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With