Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to plot interaction effects from extremely large data sets (esp. from rxGlm output)

Tags:

r

glm

microsoft-r

I am currenlty computing glm models off a huge data data set. Both glm and even speedglm take days to compute.

I currently have around 3M observations and altogether 400 variables, only some of which are used for the regression. In my regression I use 4 integer independent variables (iv1, iv2, iv3, iv4), 1 binary independent variable as factor (iv5), the interaction term (x * y, where x is an integer and y is a binary dummy variable as factor). Finally, I have fixed effects along years ff1 and company ids ff2. I have 15 years and 3000 conmpanies. I have introduced the fixed effects by adding them as factors. I observe that especially the 3000 company fixed effects make the computation very slow in stats glm and also speedglm.

I therefore decided to try Microsoft R's rxGlm (RevoScaleR), as this can address more threads and processor cores. Indeed, the speed of analysis is a lot faster. Also, I compared the results for a sub-sample to the one of standard glm and they matched.

I used the following function:

mod1 <- rxGlm(formula = dv ~ 
                      iv1 + iv2 + iv3+ 
                      iv4 + iv5 +
                      x * y +
                      ff1  + ff2,
                    family = binomial(link = "probit"), data = dat,
                    dropFirst = TRUE, dropMain = FALSE, covCoef = TRUE, cube = FALSE)

However, I am facing a problem when trying to plot the interaction term using the effects package. Upon calling the following function, I am receiving the following error:

> plot(effect("x*y", mod1))
Error in terms.default(model) : no terms component nor attribute

I assume the problem is that rxGlm does not store the data needed to plot the interaction. I believe so because the rxGlm object is a lot smaller than the glm oject, hence likely containing less data (80 MB vs several GB).

I now tried to convert the rxGlm object to glm via as.glm(). Still, the effects() call does not yield a result and results in the following error messages:

Error in dnorm(eta) : 
  Non-numerical argument for mathematical function
In addition: Warning messages:
1: In model.matrix.default(mod, data = list(dv = c(1L, 2L,  :
  variable 'x for y' is absent, its contrast will be ignored

If I now compare an original glm to the "converted glm", I find that the converted glm contains a lot less items. E.g., it does not contain effects and for contrasts it states only contr.treatment for each variable.

I am now looking primarily for a way to transpose the rxGlm output object in a format so I can use if with the effect() function. If there is no way to do so, how can I get an interaction plot using functions within the RevoScaleR package, e.g., rxLinePlot()? rxLinePlot() also plots reasonably quick, however, I have not yet found a way how to get typical interaction effect plots out of it. I want to avoid first calculating the full glm model and then plot because this takes very long.

like image 595
deca Avatar asked Nov 02 '17 16:11

deca


People also ask

How do you find the interaction of data?

In order to find an interaction, you must have a factorial design, in which the two (or more) independent variables are "crossed" with one another so that there are observations at every combination of levels of the two independent variables.

How do you find interaction effects?

To understand potential interaction effects, compare the lines from the interaction plot: If the lines are parallel, there is no interaction. If the lines are not parallel, there is an interaction.

What is interaction plot in Anova?

Interaction plots are used to understand the behavior of one variable depends on the value of another variable. Interaction effects are analyzed in regression analysis, DOE (Design of Experiments) and ANOVA (Analysis of variance).


1 Answers

If you can get the coefficients can't you just roll your own? This would not be a dataset size issue

# ex. data
n = 2000
dat <- data.frame( dv = sample(0:1, size = n, rep = TRUE), 
                   iv1 = sample(1:10, size = n, rep = TRUE),
                   iv2 = sample(1:10, size = n, rep = TRUE),
                   iv3 = sample(1:10, size = n, rep = TRUE),
                   iv4 = sample(0:10, size = n, rep = TRUE),
                   iv5 = as.factor(sample(0:1, size = n, rep = TRUE)),
                   x = sample(1:100, size = n, rep = TRUE),
                   y = as.factor(sample(0:1, size = n, rep = TRUE)),
                   ff1  = as.factor(sample(1:15, size = n, rep = TRUE)),
                   ff2  = as.factor(sample(1:100, size = n, rep = TRUE))
                   )

mod1 <- glm(formula = dv ~ 
                      iv1 + iv2 + iv3+ 
                      iv4 + iv5 +
                      x * y +
                      ff1  + ff2,
                    family = binomial(link = "probit"), data = dat)

# coefficients for x, y and their interaction
x1 <- coef(mod1)['x']
y1 <- coef(mod1)['y1']
xy <- coef(mod1)['x:y1']

x <- 1:100
a <- x1*x
b <- x1*x + y1 + xy*x

plot(a~x, type= 'line', col = 'red', xlim = c(0,max(x)), ylim = range(c(a, b)))
lines(b~x, col = 'blue')
legend('topright', c('y = 0', 'y = 1'), col = c('red', 'blue'))

here is how to make a reproduceable

like image 199
user3357177 Avatar answered Sep 24 '22 02:09

user3357177