I need to work out a 2886*2886 correlation matrix, problem is that building an intermediary datatable (RESULT
) takes a long time for it to be binded together so I would like to be able to do the following things while calling the last line RESULT=rbindlist(apply(COMB, 1, append))
in the code below :
Here is the code :
SOURCE=data.table(NAME=rep(paste0("NAME", as.character(1:2889)), each=600), VALUE=sample(c(TRUE,FALSE), 600, TRUE) )
> SOURCE
NAME VALUE
1: NAME1 TRUE
2: NAME1 TRUE
3: NAME1 TRUE
4: NAME1 TRUE
5: NAME1 TRUE
---
1733396: NAME999 TRUE
1733397: NAME999 TRUE
1733398: NAME999 TRUE
1733399: NAME999 TRUE
1733400: NAME999 FALSE
setkey(SOURCE,NAME)
a=SOURCE[,unique(NAME)]
COMB=data.table(expand.grid(a,a, stringsAsFactors=FALSE))
> COMB
Var1 Var2
1: NAME1 NAME1
2: NAME10 NAME1
3: NAME100 NAME1
4: NAME1000 NAME1
5: NAME1001 NAME1
---
8346317: NAME995 NAME999
8346318: NAME996 NAME999
8346319: NAME997 NAME999
8346320: NAME998 NAME999
8346321: NAME999 NAME999
append <- function(X) {
data.table(NAME1=X[1], VALUE1=SOURCE[X[1], VALUE],
NAME2=X[2], VALUE2=SOURCE[X[2], VALUE] )
}
RESULT=rbindlist(apply(COMB, 1, append))
Any idea ?
Also do you know if there is a faster way to generate the datatable RESULT
from SOURCE
? RESULT
is an intermediary datatable to work out the correlation values between VALUE1
and VALUE2
for each couple of NAME
.
With a subset of SOURCE
RESULT
looks like that :
SOURCE=SOURCE[sample(1:nrow(SOURCE), 3)]
setkey(SOURCE,NAME)
a=SOURCE[,unique(NAME)]
COMB=data.table(expand.grid(a,a, stringsAsFactors=FALSE))
RESULT=rbindlist(apply(COMB, 1, append))
> RESULT
NAME1 VALUE1 NAME2 VALUE2
1: NAME1859 TRUE NAME1859 TRUE
2: NAME768 FALSE NAME1859 TRUE
3: NAME795 TRUE NAME1859 TRUE
4: NAME1859 TRUE NAME768 FALSE
5: NAME768 FALSE NAME768 FALSE
6: NAME795 TRUE NAME768 FALSE
7: NAME1859 TRUE NAME795 TRUE
8: NAME768 FALSE NAME795 TRUE
9: NAME795 TRUE NAME795 TRUE
Later on I'm going to do RESULT[,VALUE3:=(VALUE1==VALUE2)]
to finally get the correlation values : RESULT[, mean(VALUE3), by=c("NAME1", "NAME2")]
So maybe the whole process can be done more efficiently, who knows.
You can use the library pbapply
(git), which shows a time estimate and a progress bar to any function in the '*apply' family.
In the case of your question:
library(pbapply)
library(data.table)
result <- data.table::rbindlist( pbapply(COMB, 1, append) )
ps. This answer solves your two initial points. Regarding the third point, I'm not sure if it's possible to pause the function. In any case, your operation is indeed taking too long, so I would recommend you post a separate question asking how to optimize your task.
You can use txtProgressBar
from the utils
package:
total <- 50
pb <- txtProgressBar(min = 0, max = total, style = 3)
lapply(1:total, function(i){
Sys.sleep(0.1)
setTxtProgressBar(pb, i)
})
OR use *ply
family from plyr
package
library(plyr)
laply(1:100, function(i) {Sys.sleep(0.05); i}, .progress = "text")
Check ?create_progress_bar()
for more details
Try this instead:
setkey(SOURCE, NAME)
SOURCE[, CJ(NAME, NAME, unique = T)][
, mean(SOURCE[V1, VALUE] == SOURCE[V2, VALUE]), by = .(V1, V2)]
Fwiw, the all-caps names are an awful choice imo - makes writing and reading code significantly harder.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With