Working in R, I have data of a similar structure to below (code block 1). And I'm looking to create a new data.frame with the following characteristics:
For each unique ID_1 value, I'd like to have two new columns, one containing a list of (ID_2s that share ID_1 & Direction==1) and the other column containing a list of (ID_2s that share ID_1 & Direction==0), (see next code block 2)
Dataset Block 1 (initial):
ID_1 ID_2 Direction
100001 1 1
100001 11 1
100001 111 1
100001 1111 0
100001 11111 0
100001 111111 0
100002 2 1
100002 22 1
100002 222 0
100002 2222 0
100003 3 1
100003 33 1
100003 333 1
100003 3333 0
100003 33333 0
100003 333333 1
100004 4 1
100004 44 1
Converted into:
Dataset Block 2 (desired output):
ID_1 ID_2_D1 ID_2_D0
100001 1,11,111 1111,11111,111111
100002 2,22 222,222
100003 3,33,333,333333 3333,33333
100004 4,44
I have code that does this, (taking loops of subset of subsets), but I am running this over many millions of unique "ID_1"s, making this very time consuming (hours, I tell ya!!).
Any advice - perhaps using apply() or the plyr() package that might get this to run faster?
Code for reference:
DF <- data.frame(ID_1=c(100001,100001,100001,100001,100001,100001,100002,100002,100002,100002,100003,100003,100003,100003,100003,100003,100004,100004)
,ID_2=c(1,11,111,1111,11111,111111,2,22,222,2222,3,33,333,3333,33333,333333,4,44)
,Direction=c(1,1,1,0,0,0,1,1,0,0,1,1,1,0,0,1,1,1)
)
My current (too slow) code:
DF2 <- data.frame( ID_1=DF[!duplicated(DF$ID_1),][,1])
for (i in 1:length(unique(DF2$ID_1))){
DF2$ID_2_D1[i] <- list(subset(DF,ID_1==unique(DF2$ID_1)[i] & Direction==1)$ID_2)
DF2$ID_2_D0[i] <- list(subset(DF,ID_1==unique(DF2$ID_1)[i] & Direction==0)$ID_2)
}
Like this:
library(reshape2)
dcast(DF, ID_1 ~ Direction, value.var = "ID_2", list)
# ID_1 0 1
# 1 100001 1111, 11111, 111111 1, 11, 111
# 2 100002 222, 2222 2, 22
# 3 100003 3333, 33333 3, 33, 333, 333333
# 4 100004 4, 44
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With