Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rearrange columns based on coverage of previous columns

Tags:

r

apply

I'm working on a test coverage analysis and I would like to rearrange a matrix so that the columns are ordered by number of "additional" test failures.

As an example I have a matrix with TRUE and FALSE where TRUE indicates a failure.

df <- structure(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE), .Dim = c(10L, 3L), .Dimnames = list(NULL, c("t1", "t2", "t3")))

t2 has the highest number of failures and should be the first column. t1 has the next highest but all its failures (per row) are covered by t2. t3 however has fewer failures but the last two failures are not covered by t2 thus should be the second column.

Desired column order based on fail coverage:

df <- structure(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE), .Dim = c(10L, 3L), .Dimnames = list(NULL, c("t2", "t3", "t1")))

I was able to get a count of "additional" fails per test using a for loop in conjunction with apply function but performance is really bad when there is a lot of columns and rows in the data set. I do however prefer to rearrange the column for further processing.

for (n in 2:ncol(out)) {
  idx <- which.max(apply(out, 2, sum, na.rm = T))
  col.list <- c(col.list, names(idx))
  val.list <- c(val.list, sum(out.2[ ,idx], na.rm = T))
  out[out[ ,idx] == T, ] <- F
  out <- out[ ,-idx]
}

Can anyone suggest a better approach to do this? Maybe not using a for loop?

Thanks.

like image 407
alaj Avatar asked Nov 07 '22 00:11

alaj


1 Answers

Here's a somewhat similar approach to OP's but I hope it will perform slightly better (not tested though):

select_cols <- names(tail(sort(colSums(df)), 1)) # first col
for(i in seq_len(ncol(df)-1)) {
  remaining_cols <- setdiff(colnames(df), select_cols)
  idx <- rowSums(df[, select_cols, drop=FALSE]) > 0
  select_cols <- c(select_cols, 
                   names(tail(sort(colSums(df[!idx, remaining_cols, drop=FALSE])), 1)))
}
df <- df[, select_cols]
df

#        t2    t3    t1
# [1,]  TRUE FALSE  TRUE
# [2,]  TRUE FALSE  TRUE
# [3,]  TRUE FALSE  TRUE
# [4,]  TRUE FALSE  TRUE
# [5,]  TRUE FALSE  TRUE
# [6,]  TRUE FALSE  TRUE
# [7,]  TRUE FALSE FALSE
# [8,]  TRUE  TRUE FALSE
# [9,] FALSE  TRUE FALSE
# [10,] FALSE  TRUE FALSE

Update: try this slightly modified version - it is a lot faster and I think it will produce correct results:

  select_cols <- names(tail(sort(colSums(m)), 1)) # first col
  idx <- rowSums(m[, select_cols, drop = FALSE]) > 0
  for(i in seq_len(ncol(m)-1)) {
    remaining_cols <- setdiff(colnames(m), select_cols)
    idx[!idx] <- rowSums(m[!idx, select_cols, drop=FALSE]) > 0
    select_cols <- c(select_cols, 
                     names(tail(sort(colSums(m[!idx, remaining_cols, drop=FALSE])), 1)))
  }
  m <- m[, select_cols]
  m

The main difference between the two is this line:

idx[!idx] <- rowSums(m[!idx, select_cols, drop=FALSE]) > 0

which means we don't need to compute rowSums for rows where any previously selected column is already true.

like image 55
talat Avatar answered Nov 15 '22 06:11

talat