Firstly I'm pretty sure this has been answered before but the search terms seem difficult to hit, apologies if there is a duplicate out there.
Say I have a vector of factors:
all <- factor(letters)
And I've gone on to use all combinations of those factor levels as part of a modelling pipeline:
combos <- t(combn(as.character(all), 5))
head(combos)
# [,1] [,2] [,3] [,4] [,5]
# [1,] "a" "b" "c" "d" "e"
# [2,] "a" "b" "c" "d" "f"
# [3,] "a" "b" "c" "d" "g"
# ...
My question is: How can I convert this second matrix to one showing presence/absence of all levels, like:
a b c d e f g ...
[1,] 1 1 1 1 1 0 0 ...
[2,] 1 1 1 1 0 1 0 ...
[3,] 1 1 1 1 0 0 1 ...
...
In terms of what I've tried, my first thought was a row-wise application of ifelse
using apply
, but I haven't been able to put anything workable together. Any smart way of doing this?
You can use matrix indexing to get even better speeds. Here is a much improved solution that does not use a for loop.
all <- factor(letters)
combos <- t(combn(as.character(all), 5))
A <- match(c(t(combos)), letters)
B <- 0:(length(A)-1) %/% 5 + 1
a <- unique(as.vector(combos))
x <- matrix(0, ncol = length(a), nrow = nrow(combos),
dimnames = list(NULL, a))
x[cbind(B, A)] <- 1L
orig <- function() {
a <- unique(as.vector(combos))
x <- matrix(0, ncol = length(a), nrow = nrow(combos),
dimnames = list(NULL, a))
for (i in 1:nrow(combos)) {
x[i, combos[i, ]] <- 1
}
x
}
new <- function() {
A <- match(c(t(combos)), letters)
B <- 0:(length(A)-1) %/% 5 + 1
a <- unique(as.vector(combos))
x <- matrix(0, ncol = length(a), nrow = nrow(combos),
dimnames = list(NULL, a))
x[cbind(B, A)] <- 1L
x
}
identical(orig(), new())
# [1] TRUE
library(microbenchmark)
microbenchmark(orig(), new(), times = 20)
# Unit: milliseconds
# expr min lq median uq max neval
# orig() 476.85206 486.11091 497.48429 512.4333 579.2695 20
# new() 87.02026 91.17021 96.88463 111.6414 175.6339 20
In a problem like this, a for
loop would work just fine and can be easily preallocated:
a <- unique(as.vector(combos))
x <- matrix(0, ncol = length(a), nrow = nrow(combos),
dimnames = list(NULL, a))
for (i in 1:nrow(combos)) {
x[i, combos[i, ]] <- 1
}
head(x)
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# [1,] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [2,] 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [3,] 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [4,] 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [5,] 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [6,] 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Here's my attempt:
combos.out <- t(apply(combos, 1, function(x) table(factor(x, levels = letters))))
head(combos.out)
# a b c d e f g h i j k l m n o p q r s t u v w x y z
# [1,] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [2,] 1 1 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [3,] 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [4,] 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [5,] 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# [6,] 1 1 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
WRT @Ananda Mahto's comment, the manipulation through transformation and factorisation definitely slows things down - a quick and dirty benchmark:
#Unit: milliseconds
# expr min lq median uq max neval
# forfun(combos) 416.6027 534.6973 652.7919 718.4231 784.0544 3
# applyfun(combos) 13892.7020 15755.8570 17619.0121 22559.8271 27500.6421 3
Score one for the for
loop!
A simple, and pretty efficient solution:
t(apply(combos,1,function(x){all %in% x}))*1
The for loop solution by Ananda Mahto is still about twice as fast:
min lq median uq max neval
561.2153 638.4648 643.439 650.7053 1199.857 100
versus
min lq median uq max neval
295.8798 305.0586 311.9961 370.6028 406.9336 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With