Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

List all combinations of factors (interactions) with no observations in a dataframe, up to a given dimension, removing redundancies

For a given dataframe with only factor columns, I want to list all combinations of factors for up to m attributes that do not appear in the data. Below is a simple example:

d <- expand.grid(w=factor(1:2), x=factor(1:2), y=factor(1:2),
                 z=factor(1:2))

# These combinations are removed by tail():
rmcomb <- 5; head(d, rmcomb)

##   w x y z
## 1 1 1 1 1
## 2 2 1 1 1
## 3 1 2 1 1
## 4 2 2 1 1
## 5 1 1 2 1


d <- tail(d, -rmcomb)
ftable(d, row.vars=c("w", "x"))

##     y 1   2  
##     z 1 2 1 2
## w x          
## 1 1   0 1 0 1
##   2   0 1 1 1
## 2 1   0 1 1 1
##   2   0 1 1 1

For m == 3, we consider all 4 + 6 + 4 = 14 combinations of up to three attributes in d:

m <- 3
library(plyr)
llply(
  1:m,
  function(i) combn(ncol(d), i, simplify=F)
) -> cc
unlist(cc, recursive=F) -> cc
length(cc)

## [1] 14

We can now tabulate selected columns of the data using table, and use which to find entries with zeros:

llply(
  cc,
  function(cols) {
    which(table(d[, cols]) == 0, arr.ind=T) -> z
    colnames(z) <- names(d)[cols]
    if (nrow(z) > 0) list(z) else NULL
  }
) -> zz
unlist(zz, recursive=F)

## [[1]]
##   y z
## 1 1 1
## 
## [[2]]
##   w x z
## 1 1 1 1
## 
## [[3]]
##   w y z
## 1 1 1 1
## 2 2 1 1
## 
## [[4]]
##   x y z
## 1 1 1 1
## 2 2 1 1

However, items [[3]] and [[4]] in the result above are redundant, because they are covered by item [[1]] (=no observations with y == 1, z == 1). The solution should be thus (y,z) == (1,1); (w,x,z) == (1,1,1).

Is there a built-in facility in R that would solve the problem with less coding, perhaps including removal of redundant (=covered) tuples? If not, how would you remove these redundant items for the code above?

like image 289
krlmlr Avatar asked Aug 16 '13 17:08

krlmlr


3 Answers

Here's how you can continue your algo to pick out those sequences. First let's convert your list to a matrix, with NA's filled in. I find this easier to deal with, but I'm sure with some effort you can make it work with a list as well:

m = as.matrix(rbind.fill(lapply(zz, as.data.frame)))
#      y z  w  x
#[1,]  1 1 NA NA
#[2,] NA 1  1  1
#[3,]  1 1  1 NA
#[4,]  1 1  2 NA
#[5,]  1 1 NA  1
#[6,]  1 1 NA  2

Now let's introduce a function which will tell us if each row of a matrix given by subseq is a "subsequence" of seq, meaning it is already covered by seq as per OP's definitions:

is.subsequence = function(seq, subseq) {
  comp = seq == t(subseq)

  rowSums(t(is.na(comp) == is.na(seq) &
            matrix(!(comp %in% FALSE), nrow = length(seq)))) == length(seq)
}

All that's left is to iterate over the matrix and throw out the covered sequences. We can do this going from top to bottom because of the automatic arrangement of zz from OP.

i = 1
while(i < nrow(m)) {
  m = rbind(m[1:i,], tail(m, -i)[!is.subsequence(m[i,], tail(m, -i)),])

  i = i+1
}

m
#      y z  w  x
#[1,]  1 1 NA NA
#[2,] NA 1  1  1

And you can go back to a list if you like:

apply(m, 1, na.omit)
like image 71
eddi Avatar answered Nov 01 '22 05:11

eddi


If your data is potentially sparse, and m is not very small (like m = 5 or more), then there can be many, many combinations of values that do not appear in the data, and they may have high redundancy so that the "abbreviated" set of absent value combinations is noticeably smaller than the full set. In this case, from a general programming perspective you'd be better off reorganizing so that you recursively build up the set of all combinations of up to m values that do not appear in the data (depth first recursion), and when you find a new absent value tuple then you do not recurse any further. This automatically prevents any redundancy in the output, and also saves time by avoiding exploring redundant tuples. Sets and/or hash tables will allow you to check if a particular value combination is present in your data in constant time. Of course recursive function calling is going to be much slower in an interpreted language like R, compared to say c/c++. I haven't seen any R packages that do essentially what you want automatically, so my personal advice if you want a universally efficient solution is to turn to a language like c/c++, and you can always use an R c/c++ integration framework so that your c/c++ function is callable from R.

like image 2
user2566092 Avatar answered Nov 01 '22 05:11

user2566092


Here's one approach based on figuring out the total set of values, and then removing the ones which are already present in the data. It obviously requires that the total set of possibilities isn't too big.

d <- expand.grid(rep(list(factor(1:2)), 4))
names(d) <- c("w", "x", "y", "z")

# Remove 5 combinations randomly
d_miss <- d[-sample(nrow(d), 5), ]

# To find which ones are missing, build up a complete list
# (this will be the same as d in this case, but obviously
# you don't normally have d)
vals <- lapply(d_miss, unique)
all_combs <- expand.grid(vals)

# Now collapse each data frame to a single value, then
# figure out which ones are missing. There's lots of ways 
# of doing this, this is the approach plyr uses: 
# (you could also use interaction, or paste the values together)
all <- plyr::id(all_combs)
some <- plyr::id(d_miss)

# Here are the missing 
all_combs[setdiff(all, some), ]
like image 1
hadley Avatar answered Nov 01 '22 07:11

hadley