Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Correlation between numeric and logical variable gives (intended) error?

Tags:

r

data.table

Example data.

require(data.table)
dt <- data.table(rnorm(10), rnorm(10) < 0.5)

Compute correlation between numeric and logical variables gives error.

cor(dt)
#Error in cor(dt) : 'x' must be numeric

But error goes away when converting to a data frame.

cor(data.frame(dt))
#           V1         V2
#V1  1.0000000 -0.1631356
#V2 -0.1631356  1.0000000

Is this intended behaviour for data.table?

like image 491
Bram Visser Avatar asked Mar 16 '15 01:03

Bram Visser


2 Answers

cor tests whether x or y (arguments) are data.frames (using is.data.frame - which data.table will return TRUE as well) and then coerces the argument to a matrix

if (is.data.frame(x)) x <- as.matrix(x)

The issue appears to be the different ways as.matrix.data.table and as.matrix.data.frame work with the example matrix

as.matrix(dt)

returns a character matrix - this would appear to be a bug in data.table

as.matrix.data.table and as.matrix.data.frame appear to share similar code for coercing that is dispatching differently

# data.table:::as.matrix.data.table
else if (non.numeric) {
        for (j in seq_len(p)) {
            if (is.character(X[[j]])) 
                next
            xj <- X[[j]]
            miss <- is.na(xj)
            xj <- if (length(levels(xj))) 
                as.vector(xj)
            else format(xj)
            is.na(xj) <- miss
            X[[j]] <- xj
        }
    }
## base::as.matrix.data.frame
else if (non.numeric) {
    for (j in pseq) {
        if (is.character(X[[j]])) 
            next
        xj <- X[[j]]
        miss <- is.na(xj)
        xj <- if (length(levels(xj))) 
            as.vector(xj)
        else format(xj)
        is.na(xj) <- miss
        X[[j]] <- xj
    }
}

Currently the data.table version coerces the logical column to a character.

like image 147
mnel Avatar answered Sep 28 '22 09:09

mnel


This bug, #1083, is now fixed in level v1.9.5 with commit #1797.

require(data.table)
set.seed(45L)
dt <- data.table(rnorm(10), rnorm(10) < 0.5)
#             V1    V2
#  1:  0.3407997  TRUE
#  2: -0.7033403  TRUE
#  3: -0.3795377 FALSE
#  4: -0.7460474 FALSE
#  5: -0.8981073  TRUE
#  6: -0.3347941  TRUE
#  7: -0.5013782  TRUE
#  8: -0.1745357  TRUE
#  9:  1.8090374 FALSE
# 10: -0.2301050  TRUE
as.matrix(dt)
#               V1 V2
#  [1,]  0.3407997  1
#  [2,] -0.7033403  1
#  [3,] -0.3795377  0
#  [4,] -0.7460474  0
#  [5,] -0.8981073  1
#  [6,] -0.3347941  1
#  [7,] -0.5013782  1
#  [8,] -0.1745357  1
#  [9,]  1.8090374  0
# [10,] -0.2301050  1
like image 44
Arun Avatar answered Sep 28 '22 09:09

Arun