How to plot interaction effects from extremely large data sets (esp. from rxGlm output)

Tags:

I am currenlty computing glm models off a huge data data set. Both glm and even speedglm take days to compute.

I currently have around 3M observations and altogether 400 variables, only some of which are used for the regression. In my regression I use 4 integer independent variables (iv1, iv2, iv3, iv4), 1 binary independent variable as factor (iv5), the interaction term (x * y, where x is an integer and y is a binary dummy variable as factor). Finally, I have fixed effects along years ff1 and company ids ff2. I have 15 years and 3000 conmpanies. I have introduced the fixed effects by adding them as factors. I observe that especially the 3000 company fixed effects make the computation very slow in stats glm and also speedglm.

I therefore decided to try Microsoft R's rxGlm (RevoScaleR), as this can address more threads and processor cores. Indeed, the speed of analysis is a lot faster. Also, I compared the results for a sub-sample to the one of standard glm and they matched.

I used the following function:

mod1 <- rxGlm(formula = dv ~ 
                      iv1 + iv2 + iv3+ 
                      iv4 + iv5 +
                      x * y +
                      ff1  + ff2,
                    family = binomial(link = "probit"), data = dat,
                    dropFirst = TRUE, dropMain = FALSE, covCoef = TRUE, cube = FALSE)

However, I am facing a problem when trying to plot the interaction term using the effects package. Upon calling the following function, I am receiving the following error:

> plot(effect("x*y", mod1))
Error in terms.default(model) : no terms component nor attribute

I assume the problem is that rxGlm does not store the data needed to plot the interaction. I believe so because the rxGlm object is a lot smaller than the glm oject, hence likely containing less data (80 MB vs several GB).

I now tried to convert the rxGlm object to glm via as.glm(). Still, the effects() call does not yield a result and results in the following error messages:

Error in dnorm(eta) : 
  Non-numerical argument for mathematical function
In addition: Warning messages:
1: In model.matrix.default(mod, data = list(dv = c(1L, 2L,  :
  variable 'x for y' is absent, its contrast will be ignored

If I now compare an original glm to the "converted glm", I find that the converted glm contains a lot less items. E.g., it does not contain effects and for contrasts it states only contr.treatment for each variable.

I am now looking primarily for a way to transpose the rxGlm output object in a format so I can use if with the effect() function. If there is no way to do so, how can I get an interaction plot using functions within the RevoScaleR package, e.g., rxLinePlot()? rxLinePlot() also plots reasonably quick, however, I have not yet found a way how to get typical interaction effect plots out of it. I want to avoid first calculating the full glm model and then plot because this takes very long.

595

asked Nov 02 '17 16:11

deca

1 Answers

If you can get the coefficients can't you just roll your own? This would not be a dataset size issue

# ex. data
n = 2000
dat <- data.frame( dv = sample(0:1, size = n, rep = TRUE), 
                   iv1 = sample(1:10, size = n, rep = TRUE),
                   iv2 = sample(1:10, size = n, rep = TRUE),
                   iv3 = sample(1:10, size = n, rep = TRUE),
                   iv4 = sample(0:10, size = n, rep = TRUE),
                   iv5 = as.factor(sample(0:1, size = n, rep = TRUE)),
                   x = sample(1:100, size = n, rep = TRUE),
                   y = as.factor(sample(0:1, size = n, rep = TRUE)),
                   ff1  = as.factor(sample(1:15, size = n, rep = TRUE)),
                   ff2  = as.factor(sample(1:100, size = n, rep = TRUE))
                   )

mod1 <- glm(formula = dv ~ 
                      iv1 + iv2 + iv3+ 
                      iv4 + iv5 +
                      x * y +
                      ff1  + ff2,
                    family = binomial(link = "probit"), data = dat)

# coefficients for x, y and their interaction
x1 <- coef(mod1)['x']
y1 <- coef(mod1)['y1']
xy <- coef(mod1)['x:y1']

x <- 1:100
a <- x1*x
b <- x1*x + y1 + xy*x

plot(a~x, type= 'line', col = 'red', xlim = c(0,max(x)), ylim = range(c(a, b)))
lines(b~x, col = 'blue')
legend('topright', c('y = 0', 'y = 1'), col = c('red', 'blue'))

here is how to make a reproduceable

199

answered Sep 24 '22 02:09

user3357177

Related questions
                            
                                Loops in R - Need to use index, anyway to avoid 'for'?
                            
                                filtering data frame based on NA on multiple columns
                            
                                Hide sidebar in default in shinydashboard
                            
                                how to calculate mean/median per group in a dataframe in r [duplicate]
                            
                                Preserving large numbers
                            
                                Replace -inf, NaN and NA values with zero in a dataset in R
                            
                                Estimate Cohen's d for effect size
                            
                                How to Replace Raster Values Less than 0 to NA in R code
                            
                                CARET. Relationship between data splitting and trainControl
                            
                                Line search fails in training ksvm prob.model
                            
                                Behaviour of facet_grid and scales="free" with missing data
                            
                                force boxplots from geom_boxplot to constant width
                            
                                Pie Charts in ggsubplot (ggplot2)
                            
                                Access/use R console when running a shiny app
                            
                                How to solve this error message in rmarkdown?
                            
                                Using dplyr and broom to compute kmeans on a training and test set
                            
                                Memory Leak When Opening Data Frame With RDCOMClient
                            
                                Basis provided by Ns() in R Epi package

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to plot interaction effects from extremely large data sets (esp. from rxGlm output)

Tags:

r

glm

microsoft-r

deca

People also ask

1 Answers

user3357177

Recent Activity

Donate For Us