Is is possible to make the equivalent of a merge(..., all = TRUE) with the data.table syntax (like X[Y]) ?
Specifically, I would need a very fast way of getting the result of:
item_length = data.table(index = 1:10, length = c(2,5,4,6,3),key ="index")
item_weigth = data.table(index = c(2,4,6,7,8,11), weight= c(.3,.5,.2), key = "index")
merge(x2,y2, all=TRUE)
Which is :
> merge(item_length ,item_weigth , all=TRUE)
index length weight
[1,] 1 2 NA
[2,] 2 5 0.3
[3,] 3 4 NA
[4,] 4 6 0.5
[5,] 5 3 NA
[6,] 6 2 0.2
[7,] 7 5 0.3
[8,] 8 4 0.5
[9,] 9 6 NA
[10,] 10 3 NA
[11,] 11 NA 0.2
Sorry for answering my own question, but I think this is worth sharing:
A very fast solution seems to be to update to the latest version of data.table (1.8.0). (Thank you so much, Matthew !)
Here is my test data and benchmark results:
With data.table:
full_index <- 1:5000000
ratio_in_samples <- 0.8
x <- data.table(index = sample(full_index, length(full_index)*ratio_in_samples),
var1 = rnorm(length(full_index)*ratio_in_samples),
key = "index")
y <- data.table(index = sample(full_index, length(full_index)*ratio_in_samples),
var2 = rnorm(length(full_index)*ratio_in_samples),
key = "index")
system.time(
result <- merge(x,y, all=TRUE)
)
Time with data.table:
user system elapsed
5.05 0.55 5.62
Whereas with data.frame:
full_index <- 1:5000000
ratio_in_samples <- 0.8
x <- data.frame(index = sample(full_index, length(full_index)*ratio_in_samples),
var1 = rnorm(length(full_index)*ratio_in_samples))
y <- data.frame(index = sample(full_index, length(full_index)*ratio_in_samples),
var2 = rnorm(length(full_index)*ratio_in_samples))
system.time(
result <- merge(x,y, all=TRUE)
)
Time with data.frame:
user system elapsed
78.83 1.75 80.67
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With