Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join data.table by sampling

Tags:

r

data.table

I have a few large data-sets that I'm trying to combine. I have created a toy example of what I want to do. I have three tables:

require(data.table)
set.seed(151)
x <- data.table(a=1:100000)
y <- data.table(b=letters[1:20],c=sample(LETTERS[1:4]))
proportion <- data.table(expand.grid(a=1:100000,c=LETTERS[1:4]))
proportion[,prop:=rgamma(4,shape = 1),by=a]
proportion[,prop:=prop/sum(prop),by=a]

The three tables are x, y, and proportion. For each element in x I want to sample from the entire table y using the probabilities from the table proportion and combine them into another table. The method that I came up with is:

temp <- setkey(setkey(x[,c(k=1,.SD)],k)[y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL],a,c)
temp <- temp[setkey(proportion,a,c)][,prop:=prop/.N,by=.(a,c)] # Uniform distribution within the same 'c' column group
chosen_pairs <- temp[,.SD[sample(.N,5,replace=FALSE,prob = prop)],by=a]

But this method is memory intensive and slow as it cross-joins the two table first and then sample from it. Is there a way to perform this task in an efficient (memory and time) way?

like image 767
A Gore Avatar asked May 19 '17 20:05

A Gore


People also ask

What is a rolling join?

Rolling join, known also as last observation carried forward (LOCF), is an inequality join of two tables.

How do I merge two data tables in R?

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.

When I is a data table or character vector the columns to join by must be specified using?

table (or character vector), the columns to join by must be specified using 'on=' argument (see ? data. table), by keying x (i.e. sorted, and, marked as sorted, see ? setkey), or by sharing column names between x and i (i.e., a natural join).


1 Answers

I faced somewhat similar problem in this question. I wrapped your solution into function for better comparison:

goreF <- function(x,y,proportion){
  temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)],
                                    allow.cartesian = TRUE][, k := NULL],
           a, c)
  temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)]
  chosen_pairs <- temp[, .SD[sample(.N, 5, replace = FALSE, prob = prop)],
                   by = a]
  chosen_pairs
}

My approach:

myFunction <- function(x, y, proportion){
  temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)],
                                           allow.cartesian = TRUE][, k := NULL],
             a, c)
  temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)]
  chosen_pairs <- temp[, sample(.I, 5, replace = FALSE, prob = prop), by = a]
  indexes <- chosen_pairs[[2]]
  temp[indexes]
}

require(rbenchmark)
benchmark(myFunction(x, y, proportion), goreF(x, y, proportion),
      replications = 1,
      columns = c("test", "replications", "elapsed", "relative",
                  "user.self", "sys.self"))
                          test replications elapsed relative user.self sys.self
2      goreF(x, y, proportion)            1   19.83   21.323     19.35     0.13
1 myFunction(x, y, proportion)            1    0.93    1.000      0.86     0.08

Perhaps there can be found more improvements, I will update, if found any. First two operations seems too complicated, maybe they can be shortened, but, as I did not see that they impact calculation timings, I did not rewrite them.

Update:

As pointed out in question I mentioned in the beginning, you could get into trouble with myFunction, if your groups would contain only one element. So i modified it, based on comments from that post.

myFunction2 <- function(x, y, proportion){
  temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)],
                                               allow.cartesian = TRUE][, k := NULL],
                 a, c)
  temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)]
  indexes <- temp[, .I[sample(.N, 5, replace = T, prob = prop)], by = a]
  indexes <- indexes[[2]]
  temp[indexes]
}

benchmark(myFunction(x, y, proportion), myFunction2(x, y, proportion),
          replications = 5,
          columns = c("test", "replications", "elapsed", "relative",
                      "user.self", "sys.self"))

                           test replications elapsed relative user.self sys.self
1  myFunction(x, y, proportion)            5    6.61    1.064      6.23     0.36
2 myFunction2(x, y, proportion)            5    6.21    1.000      5.71     0.26

We can see marginal speed improvement.

like image 105
minem Avatar answered Nov 10 '22 10:11

minem