Rearrange columns based on coverage of previous columns

Question

I'm working on a test coverage analysis and I would like to rearrange a matrix so that the columns are ordered by number of "additional" test failures.

As an example I have a matrix with TRUE and FALSE where TRUE indicates a failure.

df <- structure(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE), .Dim = c(10L, 3L), .Dimnames = list(NULL, c("t1", "t2", "t3")))

t2 has the highest number of failures and should be the first column. t1 has the next highest but all its failures (per row) are covered by t2. t3 however has fewer failures but the last two failures are not covered by t2 thus should be the second column.

Desired column order based on fail coverage:

df <- structure(c(TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE), .Dim = c(10L, 3L), .Dimnames = list(NULL, c("t2", "t3", "t1")))

I was able to get a count of "additional" fails per test using a for loop in conjunction with apply function but performance is really bad when there is a lot of columns and rows in the data set. I do however prefer to rearrange the column for further processing.

for (n in 2:ncol(out)) {
  idx <- which.max(apply(out, 2, sum, na.rm = T))
  col.list <- c(col.list, names(idx))
  val.list <- c(val.list, sum(out.2[ ,idx], na.rm = T))
  out[out[ ,idx] == T, ] <- F
  out <- out[ ,-idx]
}

Can anyone suggest a better approach to do this? Maybe not using a for loop?

Thanks.

talat · Accepted Answer

Here's a somewhat similar approach to OP's but I hope it will perform slightly better (not tested though):

select_cols <- names(tail(sort(colSums(df)), 1)) # first col
for(i in seq_len(ncol(df)-1)) {
  remaining_cols <- setdiff(colnames(df), select_cols)
  idx <- rowSums(df[, select_cols, drop=FALSE]) > 0
  select_cols <- c(select_cols, 
                   names(tail(sort(colSums(df[!idx, remaining_cols, drop=FALSE])), 1)))
}
df <- df[, select_cols]
df

#        t2    t3    t1
# [1,]  TRUE FALSE  TRUE
# [2,]  TRUE FALSE  TRUE
# [3,]  TRUE FALSE  TRUE
# [4,]  TRUE FALSE  TRUE
# [5,]  TRUE FALSE  TRUE
# [6,]  TRUE FALSE  TRUE
# [7,]  TRUE FALSE FALSE
# [8,]  TRUE  TRUE FALSE
# [9,] FALSE  TRUE FALSE
# [10,] FALSE  TRUE FALSE

Update: try this slightly modified version - it is a lot faster and I think it will produce correct results:

  select_cols <- names(tail(sort(colSums(m)), 1)) # first col
  idx <- rowSums(m[, select_cols, drop = FALSE]) > 0
  for(i in seq_len(ncol(m)-1)) {
    remaining_cols <- setdiff(colnames(m), select_cols)
    idx[!idx] <- rowSums(m[!idx, select_cols, drop=FALSE]) > 0
    select_cols <- c(select_cols, 
                     names(tail(sort(colSums(m[!idx, remaining_cols, drop=FALSE])), 1)))
  }
  m <- m[, select_cols]
  m

The main difference between the two is this line:

idx[!idx] <- rowSums(m[!idx, select_cols, drop=FALSE]) > 0

which means we don't need to compute rowSums for rows where any previously selected column is already true.

Rearrange columns based on coverage of previous columns

Tags:

r

apply

alaj

1 Answers

talat

Recent Activity

Donate For Us

Rearrange columns based on coverage of previous columns

Tags:

r

apply

alaj

1 Answers

talat

Related questions

Recent Activity

Donate For Us