Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R cor returns NaN sometimes

Tags:

r

nan

correlation

I've been working on some data, available here: Dropbox' csv file (please be kind to use it to replicate the error).

When I run the code:

t<-read.csv("120.csv")
x<-NULL
for (i in 1:100){
  x<-c(x,cor(t$nitrate,t$sulfate,use="na.or.complete"))
}
sum(is.nan(x))

I get random values of the last expression, usually around 55 to 60. I expect cor to give repetible results, so I expect x to be a vector of length=100 made of identical values. See, for example, the output of two independent runs:

> x<-NULL; for (i in 1:100){x<-c(x,cor(t$nitrate,t$sulfate,use="na.or.complete"))}
> sum(is.nan(x))
[1] 62
> head(x,10)
 [1]       NaN       NaN 0.2967441       NaN 0.2967441       NaN       NaN       NaN
 [9] 0.2967441       NaN
> x<-NULL; for (i in 1:100){x<-c(x,cor(t$nitrate,t$sulfate,use="na.or.complete"))}
> sum(is.nan(x))
[1] 52
> head(x,10)
 [1] 0.2967441       NaN       NaN       NaN       NaN 0.2967441 0.2967441       NaN
 [9] 0.2967441 0.2967441
> 

I wonder if I'm doing something wrong here, or if it's a[n] [un]known bug. If that's the case, I appreciate if somebody more profficient than I helps me reporting it to CRAN.

I read a very old (2001) post where the same behavior was exhibited by cor.test (see cor.test produces NaN sometimes.

I appreciate your kind explanations, as I'm a nOOb to R. Thanks!

Per Ben's Suggestion:

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Spanish_Colombia.1252  LC_CTYPE=Spanish_Colombia.1252    LC_MONETARY=Spanish_Colombia.1252 LC_NUMERIC=C                     
[5] LC_TIME=Spanish_Colombia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] stringr_0.6.2     digest_0.6.4      RCurl_1.95-4.3    bitops_1.0-6      qpcR_1.4-0        Matrix_1.1-4      robustbase_0.91-1 rgl_0.95.1157    
 [9] minpack.lm_1.1-8  MASS_7.3-35       plyr_1.8.1        swirl_2.2.16      ggplot2_1.0.0     lattice_0.20-29  

loaded via a namespace (and not attached):
 [1] colorspace_1.2-4 DEoptimR_1.0-2   grid_3.1.1       gtable_0.1.2     httr_0.5         labeling_0.3     munsell_0.4.2    proto_0.3-10     Rcpp_0.11.3     
[10] reshape2_1.4     scales_0.2.4     testthat_0.9.1   tools_3.1.1      yaml_2.1.13  

Results of find("cor"):

> find("cor")
[1] "package:stats"

---------- ### Second Edit ###--------

I restarted the session (I didn't quite find how to pass the --vanilla argument. I'm using Rstudio), and this is the new sessionInfo:

> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=Spanish_Colombia.1252  LC_CTYPE=Spanish_Colombia.1252    LC_MONETARY=Spanish_Colombia.1252 LC_NUMERIC=C                     
[5] LC_TIME=Spanish_Colombia.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.1.1

I run the commands again in the new session, and still get sum(is.nan(x))=52 :(

Just in case it's useful:

> cor
function (x, y = NULL, use = "everything", method = c("pearson", 
    "kendall", "spearman")) 
{
    na.method <- pmatch(use, c("all.obs", "complete.obs", "pairwise.complete.obs", 
        "everything", "na.or.complete"))
    if (is.na(na.method)) 
        stop("invalid 'use' argument")
    method <- match.arg(method)
    if (is.data.frame(y)) 
        y <- as.matrix(y)
    if (is.data.frame(x)) 
        x <- as.matrix(x)
    if (!is.matrix(x) && is.null(y)) 
        stop("supply both 'x' and 'y' or a matrix-like 'x'")
    if (!(is.numeric(x) || is.logical(x))) 
        stop("'x' must be numeric")
    stopifnot(is.atomic(x))
    if (!is.null(y)) {
        if (!(is.numeric(y) || is.logical(y))) 
            stop("'y' must be numeric")
        stopifnot(is.atomic(y))
    }
    Rank <- function(u) {
        if (length(u) == 0L) 
            u
        else if (is.matrix(u)) {
            if (nrow(u) > 1L) 
                apply(u, 2L, rank, na.last = "keep")
            else row(u)
        }
        else rank(u, na.last = "keep")
    }
    if (method == "pearson") 
        .Call(C_cor, x, y, na.method, FALSE)
    else if (na.method %in% c(2L, 5L)) {
        if (is.null(y)) {
            .Call(C_cor, Rank(na.omit(x)), NULL, na.method, method == 
                "kendall")
        }
        else {
            nas <- attr(na.omit(cbind(x, y)), "na.action")
            dropNA <- function(x, nas) {
                if (length(nas)) {
                  if (is.matrix(x)) 
                    x[-nas, , drop = FALSE]
                  else x[-nas]
                }
                else x
            }
            .Call(C_cor, Rank(dropNA(x, nas)), Rank(dropNA(y, 
                nas)), na.method, method == "kendall")
        }
    }
    else if (na.method != 3L) {
        x <- Rank(x)
        if (!is.null(y)) 
            y <- Rank(y)
        .Call(C_cor, x, y, na.method, method == "kendall")
    }
    else {
        if (is.null(y)) {
            ncy <- ncx <- ncol(x)
            if (ncx == 0) 
                stop("'x' is empty")
            r <- matrix(0, nrow = ncx, ncol = ncy)
            for (i in seq_len(ncx)) {
                for (j in seq_len(i)) {
                  x2 <- x[, i]
                  y2 <- x[, j]
                  ok <- complete.cases(x2, y2)
                  x2 <- rank(x2[ok])
                  y2 <- rank(y2[ok])
                  r[i, j] <- if (any(ok)) 
                    .Call(C_cor, x2, y2, 1L, method == "kendall")
                  else NA
                }
            }
            r <- r + t(r) - diag(diag(r))
            rownames(r) <- colnames(x)
            colnames(r) <- colnames(x)
            r
        }
        else {
            if (length(x) == 0L || length(y) == 0L) 
                stop("both 'x' and 'y' must be non-empty")
            matrix_result <- is.matrix(x) || is.matrix(y)
            if (!is.matrix(x)) 
                x <- matrix(x, ncol = 1L)
            if (!is.matrix(y)) 
                y <- matrix(y, ncol = 1L)
            ncx <- ncol(x)
            ncy <- ncol(y)
            r <- matrix(0, nrow = ncx, ncol = ncy)
            for (i in seq_len(ncx)) {
                for (j in seq_len(ncy)) {
                  x2 <- x[, i]
                  y2 <- y[, j]
                  ok <- complete.cases(x2, y2)
                  x2 <- rank(x2[ok])
                  y2 <- rank(y2[ok])
                  r[i, j] <- if (any(ok)) 
                    .Call(C_cor, x2, y2, 1L, method == "kendall")
                  else NA
                }
            }
            rownames(r) <- colnames(x)
            colnames(r) <- colnames(y)
            if (matrix_result) 
                r
            else drop(r)
        }
    }
}
<bytecode: 0x0000000008ce0158>
<environment: namespace:stats>

Thanks again.

like image 287
PavoDive Avatar asked Nov 04 '14 22:11

PavoDive


2 Answers

Several comments and notes:

  • Nobody is able to reproduce your problem
  • it cannot be a problem with the 120.csv file which is all fine.
  • Really, using another use=".." option is just a workaround
  • The underlying C code in R's sources uses ISNAN(.) everywhere for detecting if a value is NA or NaN and this in term goes to your (system internal) C library's isnan(.) function.
  • you (and only you) sometimes get NaN because ISNAN(.) does not return "true" in some cases it should, and the floating point arithmetic computes with the NAs and correctly returns NaN's.

As an "old" R core member, I can assure you that ISNAN(.) is used in many many fundamental places inside R's core computations, and the observations that for you it sometimes seems to not detect NA/NaN so that they propagate into the result is very problematic. As Duncan Murdoch has said, answering your R bug report https://bugs.r-project.org/bugzilla/show_bug.cgi?id=16058 this must be a problem with your specific "system" one way or the other... As I assume you simply downloaded R from CRAN, also for R 3.1.2, and you still see the problem, I'd tend to say that your system software (Windows) or - less likely - your hardware must be slightly broken / corrupt.

like image 181
Martin Mächler Avatar answered Oct 17 '22 08:10

Martin Mächler


Giovanni, It its working fine for me. Maybe you should try to change the parameters of cor to use = "complete.obs" and see if that helps. Also you should check your CSV file weather they are corrupted or not.

I hope it helps.

like image 34
DANISH SINHA Avatar answered Oct 17 '22 09:10

DANISH SINHA