Consider I have a data frame like this,
set.seed(1)
q<-100
df <- data.frame(Var1 = round(runif(q,1,50)),
Var2 = round(runif(q,1,50)),
Var3 = round(runif(q,1,50)),
Var4 = round(runif(q,1,50)))
attach(df)
As you realized, q
is standing for setting the length of the each columns in the dataframe.
I want to make a filtering of all possible combinations of the columns. It can be anything. Let's say I am seeking for if the devision of the sums of the first two columns and the sums of the last two columns greater than 1 or not.
One thing to achieve that, using expand.grid()
function.
a <- Sys.time()
expanded <- expand.grid(Var1, Var2, Var3, Var4)
Sys.time() - a
Time difference of 8.31997 secs
expanded <- expanded[rowSums(expanded[,1:2])/ rowSums(expanded[,3:4])>1,]
However it takes a lot time! To make it faster, I tried to follow the answer with rep.int()
function in this question and designed my own function.
myexpand <- function(...) {
sapply(list(...),function(y) rep.int(y, prod(lengths(list(...)))/length(y)))
}
But it is not so promising again. It takes more time comparing to my expectation and the expand.grid
also.And, If I set a greater q
, it becomes a nigthmare!
Is there a proper way to achieve this a lot faster (1-2 seconds) with maybe matrix operations before applying expand.grid
or myexpand
. And, I wonder if it is a weakness of using an interpreted language like R... Software suggestions are also acceptable.
For this particular condition (i.e. ratio of sums > 1), you might want to consider using the data.table
package:
system.time({
#generate permutations of Var1 & Var2 and Var3 & Var4
DT12 <- DT[, CJ(Var1=Var1, Var2=Var2, unique=TRUE)][, s12 := Var1 + Var2]
DT34 <- DT[, CJ(Var3=Var3, Var4=Var4, unique=TRUE)][, s34 := Var3 + Var4]
#perform a non-equi join
DT12[DT34, on=.(s12>s34), allow.cartesian=TRUE,
.(Var1=x.Var1, Var2=x.Var2, Var3=i.Var3, Var4=i.Var4)][, s12:=NULL]
})
timing:
user system elapsed
0.02 0.06 0.08
output:
Var1 Var2 Var3 Var4
1: 2 5 2 4
2: 4 3 2 4
3: 5 2 2 4
4: 2 6 2 4
5: 4 4 2 4
---
1753416: 50 49 49 48
1753417: 50 50 49 48
1753418: 50 49 49 49
1753419: 50 50 49 49
1753420: 50 50 49 50
data:
library(data.table)
set.seed(1)
q <- 100
DT <- data.table(Var1 = round(runif(q,1,50)),
Var2 = round(runif(q,1,50)),
Var3 = round(runif(q,1,50)),
Var4 = round(runif(q,1,50)))
edit: For summing of positive numbers, you can prob use the following (caveat: it will not be faster than using a Rcpp approach).
system.time({
S <- DT[, .(UB=90 - Var1, C1=Var1)]
for (k in 2:4) {
S <- DT[S, on=paste0("Var", k, "<UB"), allow.cartesian=TRUE,
mget(c(names(S), paste0("x.Var", k)))]
setnames(S, paste0("x.Var", k), paste0("C", k))
S[, UB := UB - get(paste0("C",k))]
}
S[, UB := NULL][rowSums(S)>30L]
})
timing:
user system elapsed
3.48 4.06 3.51
output, S
:
> S
C1 C2 C3 C4
1: 14 33 14 6
2: 14 33 14 25
3: 14 33 14 24
4: 14 33 14 19
5: 14 33 14 10
---
34914725: 31 39 3 8
34914726: 31 39 3 8
34914727: 31 39 3 9
34914728: 31 39 3 16
34914729: 31 39 3 8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With