Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to resolve integer overflow errors in R estimation

I'm trying to estimate a model using speedglm in R. The dataset is large (~69.88 million rows and 38 columns). Multiplying the number of rows and columns results in ~2.7 billion which is outside the integer limit. I can't provide the data, but the following examples recreate the issue.

library(speedglm)

# large example that works 
require(biglm)
n <- 500000
k <- 500
y <- rgamma(n, 1.5, 1)
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))   
working.example <- speedglm(fo, data = da, family = Gamma(log))

# repeat with large enough size to break 
k <- 5000       # 10 times larger than above
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))   
failed.example <- speedglm(fo, data = da, family = Gamma(log))

# attempting to resolve error with chunksize
attempted.fixed.example <- speedglm(fo, data = da, family = Gamma(log), chunksize = 10^6)

This causes an error and integer overflow warning.

Error in if (!replace && is.null(prob) && n > 1e+07 && size <= n/2) .Internal(sample2(n,  :  
  missing value where TRUE/FALSE needed
In addition: Warning message:
In nrow(X) * ncol(X) : NAs produced by integer overflow 

I understand the warning, but I do not understand the error. They seem to be related in this case as they appear together after each attempt.

Removing columns allows the estimation to complete. It does not seem to matter which columns are removed; removing interacted or non-interacted variables will both result in a completed estimation. The chunksize option was added after receiving the error initially, but has not helped.

My questions are: (1) what causes the first error? (2) is there a way to estimate models using data such that the number of rows by the number of columns is larger than the integer limit? (3) is there a better na.action to use in this case?

Thanks,

JP.

Running: R version 3.3.3 (2017-03-06)

Actual code below:

dft_var <- c("cltvV0", "cltvV60", "cltvV120", "VCFLBRQ", "ageV0", 
             "ageV1", "ageV8", "ageV80", "FICOV300", "FICOV650", 
             "FICOV900", "SingleHouse", "Apt", "Mobile", "Duplex", 
             "Row", "Modular", "Rural", "FirstTimeBuyer", 
             "FirstTimeBuyerMissing", "brwtotinMissing", "IncomeRatio", 
             "VintageBefore2001", "NFLD", "yoy.fcpwti:province_n") 
logit1 <- speedglm(formula = paste("DefaultFlag ~ ", 
                                   paste(dft_var, collapse = "+"), 
                                   sep = ""), 
                   family = binomial(logit), 
                   na.action = na.exclude, 
                   data = default.data,
                   chunksize = 1*10^7)
like image 205
James Avatar asked Jun 06 '17 19:06

James


2 Answers

Update:

Based on my investigation below, @James figured out that the problem can be avoided by providing non-NULL value for the parameter sparse in the call of the speedglm function, as it prevents the internal call of the is.sparse function.

Using the example above, the following should now work:

speedglm(fo, data = da, family = Gamma(log), sparse = FALSE)

My original answer:

Both the warning and the error come from the same line in the function is.sparse in the package speedglm.

The line is:

sample(X,round((nrow(X)*ncol(X)*camp),digits=0),replace=FALSE)

The warning happens because of the use of nrow(X)*ncol(X) for a large matrix. The nrow and ncol functions return integer values, which can overflow. here is an illustration.

nr = 1000000L
nc = 1000000L
nr*nc
# [1] NA
# Warning message:
# In nr * nc : NAs produced by integer overflow

The error happens because the sample function is confused when X is a large matrix and size = NA. Here is an illustration:

sample(matrix(1,3000,1000000), NA, replace=FALSE)
# Error in if (useHash) .Internal(sample2(n, size)) else .Internal(sample(n,  : 
# missing value where TRUE/FALSE needed
like image 185
Andrey Shabalin Avatar answered Oct 03 '22 07:10

Andrey Shabalin


Thanks to @Andrey 's guidance I was able to solve the problem. The issue was the sample function in the is.sparse check. To bypass this I set sparse=FALSE in the options for speedglm (this should work for sparse=TRUE as well, though I haven't tried.) This is because speedglm calls is.sparse via speedglm.wfit in the following way:

if (is.null(sparse))
    sparse <- is.sparse(x = x, sparsellim, camp)

So setting sparse avoids the is.sparse function.

Using the example above, the following should now work:

speedglm(fo, data = da, family = Gamma(log), sparse = FALSE)
like image 36
James Avatar answered Oct 03 '22 08:10

James