Is it possible to run a mixed-effects regression model in Spark? (as we can do with lme4 in R, with MixedModels in Julia or with Statsmodels MixedLM in Python).
Any example would be great.
I've read there is a GLMix function but I don't know if the user can use it directly to fit a model and get the coefficients and p-values or if it can only be used internally by machine learning libraries.
I would like to move to Spark because my datasets are much bigger than memory.
Is there any other common database or framework able to do something like that streaming data from disk?
I've only seen some able to do simple linear regression.
Regards
A Mixed Effects Model is a statistical test used to predict a single variable using two or more other variables. It also is used to determine the numerical relationship between one variable and others. The variable you want to predict should be continuous and your data should meet the other assumptions listed below.
We focus here on mixed-model (or mixed-effects) regression analysis,21 which means that the model posited to describe the data contains both fixed effects and random effects. Fixed effects are those aspects of the model that (are assumed to) describe systematic features in the data.
Mixed effect modeling allows both fixed (aka marginal) and random effects, while GEE modeling allows for fixed effects alone. A fixed effect is akin to a population effect: some measured variable is believed to have a single effect across the population.
Machine learning algorithms can use mixed models to conceptualize data in a way that allows for understanding the effects of phenomena both between groups, and within them.
Yes, this is definitely possible with Spark.
The first thing I would look into is a rather popular library called ML Lib. I am not sure if it does exactly the kind of model that you need, but definitely more than 'simple linear regression'.
Another library 'linkedin/photon-ml', which I am not familiar with, does explictly mention mixed effect models.
Here is an example of using the Generalized Additive Mixed Effects driver:
spark-submit \
--class com.linkedin.photon.ml.cli.game.GameTrainingDriver \
--master local[*] \
--num-executors 4 \
--driver-memory 1G \
--executor-memory 1G \
"./build/photon-all_2.10/libs/photon-all_2.10-1.0.0.jar" \
--input-data-directories "./a1a/train/" \
--validation-data-directories "./a1a/test/" \
--root-output-directory "out" \
--feature-shard-configurations "name=globalShard,feature.bags=features" \
--coordinate-configurations "name=global,feature.shard=globalShard,min.partitions=4,optimizer=LBFGS,tolerance=1.0E-6,max.iter=50,regularization=L2,reg.weights=0.1|1|10|100" \
--coordinate-update-sequence "global" \
--coordinate-descent-iterations 1 \
--training-task "LOGISTIC_REGRESSION"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With