Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert factor combinations to wide format table of presence/absence in R

Tags:

r

combinations

Firstly I'm pretty sure this has been answered before but the search terms seem difficult to hit, apologies if there is a duplicate out there.

Say I have a vector of factors:

all <- factor(letters)

And I've gone on to use all combinations of those factor levels as part of a modelling pipeline:

combos <- t(combn(as.character(all), 5))
head(combos)
#     [,1] [,2] [,3] [,4] [,5]
# [1,] "a"  "b"  "c"  "d"  "e" 
# [2,] "a"  "b"  "c"  "d"  "f" 
# [3,] "a"  "b"  "c"  "d"  "g" 
# ...

My question is: How can I convert this second matrix to one showing presence/absence of all levels, like:

      a   b   c   d   e   f   g  ...
[1,]  1   1   1   1   1   0   0  ...
[2,]  1   1   1   1   0   1   0  ...
[3,]  1   1   1   1   0   0   1  ...
...

In terms of what I've tried, my first thought was a row-wise application of ifelse using apply, but I haven't been able to put anything workable together. Any smart way of doing this?

like image 446
blmoore Avatar asked Sep 30 '13 12:09

blmoore


3 Answers

Update: An even better solution

You can use matrix indexing to get even better speeds. Here is a much improved solution that does not use a for loop.

all <- factor(letters)
combos <- t(combn(as.character(all), 5))
A <- match(c(t(combos)), letters)
B <- 0:(length(A)-1) %/% 5 + 1
a <- unique(as.vector(combos))
x <- matrix(0, ncol = length(a), nrow = nrow(combos), 
            dimnames = list(NULL, a))
x[cbind(B, A)] <- 1L

Benchmarks

orig <- function() {
  a <- unique(as.vector(combos))
  x <- matrix(0, ncol = length(a), nrow = nrow(combos), 
              dimnames = list(NULL, a))
  for (i in 1:nrow(combos)) {
    x[i, combos[i, ]] <- 1
  }
  x
}

new <- function() {
  A <- match(c(t(combos)), letters)
  B <- 0:(length(A)-1) %/% 5 + 1
  a <- unique(as.vector(combos))
  x <- matrix(0, ncol = length(a), nrow = nrow(combos), 
              dimnames = list(NULL, a))
  x[cbind(B, A)] <- 1L
  x
}

identical(orig(), new())
# [1] TRUE

library(microbenchmark)
microbenchmark(orig(), new(), times = 20)
# Unit: milliseconds
#    expr       min        lq    median       uq      max neval
#  orig() 476.85206 486.11091 497.48429 512.4333 579.2695    20
#   new()  87.02026  91.17021  96.88463 111.6414 175.6339    20

Original answer

In a problem like this, a for loop would work just fine and can be easily preallocated:

a <- unique(as.vector(combos))
x <- matrix(0, ncol = length(a), nrow = nrow(combos), 
            dimnames = list(NULL, a))

for (i in 1:nrow(combos)) {
  x[i, combos[i, ]] <- 1
}

head(x)
#      a b c d e f g h i j k l m n o p q r s t u v w x y z
# [1,] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [2,] 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [3,] 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [4,] 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [5,] 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [6,] 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
like image 178
A5C1D2H2I1M1N2O1R2T1 Avatar answered Sep 30 '22 09:09

A5C1D2H2I1M1N2O1R2T1


Here's my attempt:

combos.out <- t(apply(combos, 1, function(x) table(factor(x, levels = letters))))
head(combos.out)
#      a b c d e f g h i j k l m n o p q r s t u v w x y z
# [1,] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [2,] 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [3,] 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [4,] 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [5,] 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [6,] 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

WRT @Ananda Mahto's comment, the manipulation through transformation and factorisation definitely slows things down - a quick and dirty benchmark:

#Unit: milliseconds
#             expr        min         lq     median         uq        max neval
#   forfun(combos)   416.6027   534.6973   652.7919   718.4231   784.0544     3
# applyfun(combos) 13892.7020 15755.8570 17619.0121 22559.8271 27500.6421     3

Score one for the for loop!

like image 27
alexwhan Avatar answered Sep 30 '22 10:09

alexwhan


A simple, and pretty efficient solution:

t(apply(combos,1,function(x){all %in% x}))*1

The for loop solution by Ananda Mahto is still about twice as fast:

      min       lq  median       uq      max neval
 561.2153 638.4648 643.439 650.7053 1199.857   100

versus

      min       lq   median       uq      max neval
 295.8798 305.0586 311.9961 370.6028 406.9336   100
like image 21
mrip Avatar answered Sep 30 '22 10:09

mrip