Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R: data.table cross-join not working

Tags:

join

r

data.table

I have two data.tables that I want to join (form a Cartesian product of). One of the data.tables is keyed on a Date vector, and the other on a numeric vector:

# data.table with dates (as numeric)
dtDates2 = data.table(date = 
                       as.numeric(seq(from = as.Date('2014/01/01'), 
                           to = as.Date('2014/07/01'), by = 'weeks')),
                     data1 = rnorm(26))

# data.table with dates
dtDates1 = data.table(date = 
                        seq(from = as.Date('2014/01/01'), 
                            to = as.Date('2014/07/01'), by = 'weeks'),
                      data1 = rnorm(26))


# data.table with customer IDs
dtCustomers = data.table(customerID = seq(1, 100),
                      data2 = rnorm(100))

I setkey and try to cross-join them using CJ:

# cross join the two datatables
setkey(dtCustomers, customerID)
setkey(dtDates1, date)
setkey(dtDates2, date)

CJ(dtCustomers, dtDates1)
CJ(dtCustomers, dtDates2)

but get the following error:

Error in FUN(X[[1L]], ...) : 
  Invalid column: it has dimensions. Can't format it. If it's the result of data.table(table()), use as.data.table(table()) instead.

Not sure what I am doing wrong.

like image 757
tchakravarty Avatar asked Sep 17 '14 10:09

tchakravarty


2 Answers

There is no cross join functionality available in data.table out of the box.
Yet there is CJ.dt function (a CJ like but designed for data.tables) to achieve cartesian product (cross join) available in optiRum package (available in CRAN).
You can create the function:

CJ.dt = function(X,Y) {
  stopifnot(is.data.table(X),is.data.table(Y))
  k = NULL
  X = X[, c(k=1, .SD)]
  setkey(X, k)
  Y = Y[, c(k=1, .SD)]
  setkey(Y, NULL)
  X[Y, allow.cartesian=TRUE][, k := NULL][]
}
CJ.dt(dtCustomers, dtDates1)
CJ.dt(dtCustomers, dtDates2)

Yet there is a FR for convenience way to perform cross join filled in data.table#1717, so you could check there if there is a nicer api for cross join.

like image 178
jangorecki Avatar answered Sep 28 '22 01:09

jangorecki


thank you jangorecki for the very useful function

I had to add support for empty X and/or Y:

CJ.dt = function(X, Y) {
  stopifnot(is.data.table(X), is.data.table(Y))

  if(nrow(X) > 0 & nrow(Y) > 0){
    k = NULL
    X = X[, c(k = 1, .SD)]
    setkey(X, k)
    Y = Y[, c(k = 1, .SD)]
    setkey(Y, NULL)
    return(X[Y, allow.cartesian = T][, k := NULL][])

  } else {
    duplicatedNames <- names(Y)[names(Y) %in% names(X)]
    if(length(duplicatedNames) > 0) {
       setnames(Y, duplicatedNames, paste0("i.", duplicatedNames))
    }
    setkey(Y)
    setkey(X)
    return(cbind(X[!X], Y[!Y]))
  }

}

# X <- data.table(a = c(1, 2))
# Y <- data.table(a = c(2, 3), b = c(4, 5))
#
# CJ.dt(X, Y)

# CJ.dt(X[a > 2], Y)
like image 37
Fabian Gehring Avatar answered Sep 28 '22 02:09

Fabian Gehring