Join data.table by sampling

Tags:

data.table

I have a few large data-sets that I'm trying to combine. I have created a toy example of what I want to do. I have three tables:

require(data.table)
set.seed(151)
x <- data.table(a=1:100000)
y <- data.table(b=letters[1:20],c=sample(LETTERS[1:4]))
proportion <- data.table(expand.grid(a=1:100000,c=LETTERS[1:4]))
proportion[,prop:=rgamma(4,shape = 1),by=a]
proportion[,prop:=prop/sum(prop),by=a]

The three tables are x, y, and proportion. For each element in x I want to sample from the entire table y using the probabilities from the table proportion and combine them into another table. The method that I came up with is:

temp <- setkey(setkey(x[,c(k=1,.SD)],k)[y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL],a,c)
temp <- temp[setkey(proportion,a,c)][,prop:=prop/.N,by=.(a,c)] # Uniform distribution within the same 'c' column group
chosen_pairs <- temp[,.SD[sample(.N,5,replace=FALSE,prob = prop)],by=a]

But this method is memory intensive and slow as it cross-joins the two table first and then sample from it. Is there a way to perform this task in an efficient (memory and time) way?

767

asked May 19 '17 20:05

A Gore

1 Answers

I faced somewhat similar problem in this question. I wrapped your solution into function for better comparison:

goreF <- function(x,y,proportion){
  temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)],
                                    allow.cartesian = TRUE][, k := NULL],
           a, c)
  temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)]
  chosen_pairs <- temp[, .SD[sample(.N, 5, replace = FALSE, prob = prop)],
                   by = a]
  chosen_pairs
}

My approach:

myFunction <- function(x, y, proportion){
  temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)],
                                           allow.cartesian = TRUE][, k := NULL],
             a, c)
  temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)]
  chosen_pairs <- temp[, sample(.I, 5, replace = FALSE, prob = prop), by = a]
  indexes <- chosen_pairs[[2]]
  temp[indexes]
}

require(rbenchmark)
benchmark(myFunction(x, y, proportion), goreF(x, y, proportion),
      replications = 1,
      columns = c("test", "replications", "elapsed", "relative",
                  "user.self", "sys.self"))
                          test replications elapsed relative user.self sys.self
2      goreF(x, y, proportion)            1   19.83   21.323     19.35     0.13
1 myFunction(x, y, proportion)            1    0.93    1.000      0.86     0.08

Perhaps there can be found more improvements, I will update, if found any. First two operations seems too complicated, maybe they can be shortened, but, as I did not see that they impact calculation timings, I did not rewrite them.

Update:

As pointed out in question I mentioned in the beginning, you could get into trouble with myFunction, if your groups would contain only one element. So i modified it, based on comments from that post.

myFunction2 <- function(x, y, proportion){
  temp <- setkey(setkey(x[, c(k = 1, .SD)], k)[y[,c(k = 1, .SD)],
                                               allow.cartesian = TRUE][, k := NULL],
                 a, c)
  temp <- temp[setkey(proportion, a, c)][, prop := prop / .N, by = .(a, c)]
  indexes <- temp[, .I[sample(.N, 5, replace = T, prob = prop)], by = a]
  indexes <- indexes[[2]]
  temp[indexes]
}

benchmark(myFunction(x, y, proportion), myFunction2(x, y, proportion),
          replications = 5,
          columns = c("test", "replications", "elapsed", "relative",
                      "user.self", "sys.self"))

                           test replications elapsed relative user.self sys.self
1  myFunction(x, y, proportion)            5    6.61    1.064      6.23     0.36
2 myFunction2(x, y, proportion)            5    6.21    1.000      5.71     0.26

We can see marginal speed improvement.

105

answered Nov 10 '22 10:11

minem

Related questions
                            
                                Error in ls(envir = envir, all.names = private) : invalid 'envir' argument in R
                            
                                Base function that behaves like `cat` but returns value instead of writing to file
                            
                                Why is GGally::ggpairs significantly slower in RStudio vs. base R?
                            
                                How to assign fixed memory size to a variable in R
                            
                                Combine group_by and distinct
                            
                                Rcharts nvd3 2-D zoom possible?
                            
                                R / RStudio : graph scaling issues & fuzziness on high dpi screens
                            
                                How do I quickly find out whether two (large) factors are relabelings of each other?
                            
                                Treat words separated by space in the same manner
                            
                                R could not find function "%dopar%"
                            
                                Why is dplyr removing values not met by condition?
                            
                                R - ggplot geom_dotplot shape option
                            
                                R's shiny app goes grey when deployed, works fine locally
                            
                                Saving .R script File Using Script
                            
                                Load balancing R requests coming to RServe
                            
                                What is the difference betwen Microsoft R Open (MRO) and R?
                            
                                Understanding parallel TSQL connections
                            
                                Jupyter Notebook rpy2 Rmagics: How to set the default plot size?
                            
                                R how to restrict the names that are in scope to those I create explicitly?
                            
                                Rmarkdown: writing inline dplyr code if column names have spaces defined with backticks

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With