I'm trying to estimate a model using speedglm in R. The dataset is large (~69.88 million rows and 38 columns). Multiplying the number of rows and columns results in ~2.7 billion which is outside the integer limit. I can't provide the data, but the following examples recreate the issue.
library(speedglm)
# large example that works
require(biglm)
n <- 500000
k <- 500
y <- rgamma(n, 1.5, 1)
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))
working.example <- speedglm(fo, data = da, family = Gamma(log))
# repeat with large enough size to break
k <- 5000 # 10 times larger than above
x <- round(matrix(rnorm(n*k), n, k), digits = 3)
colnames(x) <- paste("s", 1:k, sep = "")
da <- data.frame(y, x)
fo <- as.formula(paste("y~", paste(paste("s", 1:k, sep = ""), collapse = "+")))
failed.example <- speedglm(fo, data = da, family = Gamma(log))
# attempting to resolve error with chunksize
attempted.fixed.example <- speedglm(fo, data = da, family = Gamma(log), chunksize = 10^6)
This causes an error and integer overflow warning.
Error in if (!replace && is.null(prob) && n > 1e+07 && size <= n/2) .Internal(sample2(n, :
missing value where TRUE/FALSE needed
In addition: Warning message:
In nrow(X) * ncol(X) : NAs produced by integer overflow
I understand the warning, but I do not understand the error. They seem to be related in this case as they appear together after each attempt.
Removing columns allows the estimation to complete. It does not seem to matter which columns are removed; removing interacted or non-interacted variables will both result in a completed estimation. The chunksize option was added after receiving the error initially, but has not helped.
My questions are: (1) what causes the first error? (2) is there a way to estimate models using data such that the number of rows by the number of columns is larger than the integer limit? (3) is there a better na.action to use in this case?
Thanks,
JP.
Running: R version 3.3.3 (2017-03-06)
Actual code below:
dft_var <- c("cltvV0", "cltvV60", "cltvV120", "VCFLBRQ", "ageV0",
"ageV1", "ageV8", "ageV80", "FICOV300", "FICOV650",
"FICOV900", "SingleHouse", "Apt", "Mobile", "Duplex",
"Row", "Modular", "Rural", "FirstTimeBuyer",
"FirstTimeBuyerMissing", "brwtotinMissing", "IncomeRatio",
"VintageBefore2001", "NFLD", "yoy.fcpwti:province_n")
logit1 <- speedglm(formula = paste("DefaultFlag ~ ",
paste(dft_var, collapse = "+"),
sep = ""),
family = binomial(logit),
na.action = na.exclude,
data = default.data,
chunksize = 1*10^7)
Update:
Based on my investigation below, @James figured out that the problem can be avoided by providing non-NULL
value for the parameter sparse
in the call of the speedglm
function, as it prevents the internal call of the is.sparse
function.
Using the example above, the following should now work:
speedglm(fo, data = da, family = Gamma(log), sparse = FALSE)
My original answer:
Both the warning and the error come from the same line in the function is.sparse
in the package speedglm
.
The line is:
sample(X,round((nrow(X)*ncol(X)*camp),digits=0),replace=FALSE)
The warning happens because of the use of nrow(X)*ncol(X)
for a large matrix. The nrow
and ncol
functions return integer
values, which can overflow. here is an illustration.
nr = 1000000L
nc = 1000000L
nr*nc
# [1] NA
# Warning message:
# In nr * nc : NAs produced by integer overflow
The error happens because the sample
function is confused when X is a large matrix and size = NA
. Here is an illustration:
sample(matrix(1,3000,1000000), NA, replace=FALSE)
# Error in if (useHash) .Internal(sample2(n, size)) else .Internal(sample(n, :
# missing value where TRUE/FALSE needed
Thanks to @Andrey 's guidance I was able to solve the problem. The issue was the sample function in the is.sparse
check. To bypass this I set sparse=FALSE
in the options for speedglm
(this should work for sparse=TRUE
as well, though I haven't tried.) This is because speedglm
calls is.sparse
via speedglm.wfit
in the following way:
if (is.null(sparse))
sparse <- is.sparse(x = x, sparsellim, camp)
So setting sparse
avoids the is.sparse
function.
Using the example above, the following should now work:
speedglm(fo, data = da, family = Gamma(log), sparse = FALSE)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With