Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

multinomial logistic multilevel models in R

Problem: I need to estimate a set of multinomial logistic multilevel models and can’t find an appropriate R package. What is the best R package to estimate such models? STATA 13 recently added this feature to their multilevel mixed-effects models – so the technology to estimate such models seems to be available.

Details: A number of research questions require the estimation of multinomial logistic regression models in which the outcome variable is categorical. For example, biologists might be interested to investigate which type of trees (e.g., pine trees, maple trees, oak trees) are most impacted by acid rain. Market researchers might be interested whether there is a relationship between the age of customers and the frequency of shopping at Target, Safeway, or Walmart. These cases have in common that the outcome variable is categorical (unordered) and multinomial logistic regressions are the preferred method of estimation. In my case, I am investigating differences in types of human migration, with the outcome variable (mig) coded 0=not migrated, 1=internal migration, 2=international migration. Here is a simplified version of my data set:

migDat=data.frame(hhID=1:21,mig=rep(0:2,times=7),age=ceiling(runif(21,15,90)),stateID=rep(letters[1:3],each=7),pollution=rep(c("high","low","moderate"),each=7),stringsAsFactors=F)     hhID mig age stateID pollution 1     1   0  47       a      high 2     2   1  53       a      high 3     3   2  17       a      high 4     4   0  73       a      high 5     5   1  24       a      high 6     6   2  80       a      high 7     7   0  18       a      high 8     8   1  33       b       low 9     9   2  90       b       low 10   10   0  49       b       low 11   11   1  42       b       low 12   12   2  44       b       low 13   13   0  82       b       low 14   14   1  70       b       low 15   15   2  71       c  moderate 16   16   0  18       c  moderate 17   17   1  18       c  moderate 18   18   2  39       c  moderate 19   19   0  35       c  moderate 20   20   1  74       c  moderate 21   21   2  86       c  moderate 

My goal is to estimate the impact of age (independent variable) on the odds of (1) migrating internally vs. not migrating, (2) migrating internationally vs. not migrating, (3) migrating internally vs. migrating internationally. An additional complication is that my data operate at different aggregation levels (e.g., pollution operates at the state-level) and I am also interested in predicting the impact of air pollution (pollution) on the odds of embarking on a particular type of movement.

Clunky solutions: One could estimate a set of separate logistic regression models by reducing the data set for each model to only two migration types (e.g., Model 1: only cases coded mig=0 and mig=1; Model 2: only cases coded mig=0 and mig=2; Model 3: only cases coded mig=1 and mig=2). Such a simple multilevel logistic regression model could be estimated with lme4 but this approach is less ideal because it does not appropriately account for the impact of the omitted cases. A second solution would be to run multinomial logistic multilevel models in MLWiN through R using the R2MLwiN package. But since MLWiN is not open source and the generated object difficult to use, I would prefer to avoid this option. Based on a comprehensive internet search there seem to be some demand for such models but I am not aware of a good R package. So it would be great if some experts who have run such models could provide a recommendation and if there are more than one package maybe indicate some advantages/disadvantages. I am sure that such information would be a very helpful resource for multiple R users. Thanks!!

Best, Raphael

like image 353
Raphael Avatar asked Jan 13 '14 00:01

Raphael


People also ask

What is multilevel multinomial logistic regression?

A multilevel multinomial logistic regression model was considered to predict the probability of being at or below a hemoglobin level using the available predictors. Since the outcome variable is ordinal, we consider cumulative logit link function.

What is multinomial logistic regression in R?

Multinomial logistic regression is used to model nominal outcome variables, in which the log odds of the outcomes are modeled as a linear combination of the predictor variables.

Is multinomial logistic regression the same as multiple regression?

What is Multinomial Logistic Regression? Multinomial logistic regression is used when you have a categorical dependent variable with two or more unordered levels (i.e. two or more discrete outcomes). It is practically identical to logistic regression, except that you have multiple possible outcomes instead of just one.

What is the difference between logit and multinomial logit models?

A logit model is a limited dependent variable model that handles only binary outcomes (e.g. 0/1). A multinomial model, in contrast, handles multiple categories of an outcome (e.g. 0/1/2/3). You will see that both logit and multinomial models could be done in two stages or, in fact, be nested.


Video Answer


2 Answers

There are generally two ways of fitting a multinomial models of a categorical variable with J groups: (1) Simultaneously estimating J-1 contrasts; (2) Estimating a separate logit model for each contrast.

Produce these two methods the same results? No, but the results are often similar

Which method is better? Simultaneously fitting is more precise (see below for an explanation why)

Why would someone use separate logit models then? (1) the lme4 package has no routine for simultaneously fitting multinomial models and there is no other multilevel R package that could do this. So separate logit models are presently the only practical solution if someone wants to estimate multilevel multinomial models in R. (2) As some powerful statisticians have argued (Begg and Gray, 1984; Allison, 1984, p. 46-47), separate logit models are much more flexible as they permit for the independent specification of the model equation for each contrast.

Is it legitimate to use separate logit models? Yes, with some disclaimers. This method is called the “Begg and Gray Approximation”. Begg and Gray (1984, p. 16) showed that this “individualized method is highly efficient”. However, there is some efficiency loss and the Begg and Gray Approximation produces larger standard errors (Agresti 2002, p. 274). As such, it is more difficult to obtain significant results with this method and the results can be considered conservative. This efficiency loss is smallest when the reference category is large (Begg and Gray, 1984; Agresti 2002). R packages that employ the Begg and Gray Approximation (not multilevel) include mlogitBMA (Sevcikova and Raftery, 2012).


Why is a series of individual logit models imprecise? In my initial example we have a variable (migration) that can have three values A (no migration), B (internal migration), C (international migration). With only one predictor variable x (age), multinomial models are parameterized as a series of binomial contrasts as follows (Long and Cheng, 2004 p. 277):

Eq. 1:  Ln(Pr(B|x)/Pr(A|x)) = b0,B|A + b1,B|A (x)  Eq. 2:  Ln(Pr(C|x)/Pr(A|x)) = b0,C|A + b1,C|A (x) Eq. 3:  Ln(Pr(B|x)/Pr(C|x)) = b0,B|C + b1,B|C (x) 

For these contrasts the following equations must hold:

Eq. 4: Ln(Pr(B|x)/Pr(A|x)) + Ln(Pr(C|x)/Pr(A|x)) = Ln(Pr(B|x)/Pr(C|x)) Eq. 5: b0,B|A + b0,C|A = b0,B|C Eq. 6: b1,B|A + b1,C|A = b1,B|C 

The problem is that these equations (Eq. 4-6) will in praxis not hold exactly because the coefficients are estimated based on slightly different samples since only cases from the two contrasting groups are used und cases from the third group are omitted. Programs that simultaneously estimate the multinomial contrasts make sure that Eq. 4-6 hold (Long and Cheng, 2004 p. 277). I don’t know exactly how this “simultaneous” model solving works – maybe someone can provide an explanation? Software that do simultaneous fitting of multilevel multinomial models include MLwiN (Steele 2013, p. 4) and STATA (xlmlogit command, Pope, 2014).


References:

Agresti, A. (2002). Categorical data analysis (2nd ed.). Hoboken, NJ: John Wiley & Sons.

Allison, P. D. (1984). Event history analysis. Thousand Oaks, CA: Sage Publications.

Begg, C. B., & Gray, R. (1984). Calculation of polychotomous logistic regression parameters using individualized regressions. Biometrika, 71(1), 11-18.

Long, S. J., & Cheng, S. (2004). Regression models for categorical outcomes. In M. Hardy & A. Bryman (Eds.), Handbook of data analysis (pp. 258-285). London: SAGE Publications, Ltd.

Pope, R. (2014). In the spotlight: Meet Stata's new xlmlogit command. Stata News, 29(2), 2-3.

Sevcikova, H., & Raftery, A. (2012). Estimation of multinomial logit model using the Begg & Gray approximation.

Steele, F. (2013). Module 10: Single-level and multilevel models for nominal responses concepts. Bristol, U.K,: Centre for Multilevel Modelling.

like image 179
Raphael Avatar answered Oct 26 '22 07:10

Raphael


An older question, but I think a viable option has recently emerged is brms, which uses the Bayesian Stan program to actually run the model For example, if you want to run a multinomial logistic regression on the iris data:

b1 <- brm (Species ~ Petal.Length + Petal.Width + Sepal.Length + Sepal.Width,            data=iris, family="categorical",            prior=c(set_prior ("normal (0, 8)"))) 

And to get an ordinal regression -- not appropriate for iris, of course -- you'd switch the family="categorical" to family="acat" (or cratio or sratio, depending on the type of ordinal regression you want) and make sure that the dependent variable is ordered.

Clarification per Raphael's comment: This brm call compiles your formula and arguments into Stan code. Stan compiles it into C++ and uses your system's C++ compiler -- which is required. On a Mac, for example, you may need to install the free Developer Tools to get C++. Not sure about Windows. Linux should have C++ installed by default.)

Clarification per Qaswed's comment: brms easily handles multilevel models as well using the R formula (1 | groupvar) to add a group (random) intercept for a group, (1 + foo | groupvar) to add a random intercept and slope, etc.

like image 23
Wayne Avatar answered Oct 26 '22 08:10

Wayne