Mixed Effects Models in Spark or other technology

Tags:

Is it possible to run a mixed-effects regression model in Spark? (as we can do with lme4 in R, with MixedModels in Julia or with Statsmodels MixedLM in Python).
Any example would be great.

I've read there is a GLMix function but I don't know if the user can use it directly to fit a model and get the coefficients and p-values or if it can only be used internally by machine learning libraries.

I would like to move to Spark because my datasets are much bigger than memory.

Is there any other common database or framework able to do something like that streaming data from disk?
I've only seen some able to do simple linear regression.

Regards

353

asked Sep 30 '16 11:09

skan

Video Answer

1 Answers

Yes, this is definitely possible with Spark.

The first thing I would look into is a rather popular library called ML Lib. I am not sure if it does exactly the kind of model that you need, but definitely more than 'simple linear regression'.

Another library 'linkedin/photon-ml', which I am not familiar with, does explictly mention mixed effect models.

Here is an example of using the Generalized Additive Mixed Effects driver:

spark-submit \
  --class com.linkedin.photon.ml.cli.game.GameTrainingDriver \
  --master local[*] \
  --num-executors 4 \
  --driver-memory 1G \
  --executor-memory 1G \
  "./build/photon-all_2.10/libs/photon-all_2.10-1.0.0.jar" \
  --input-data-directories "./a1a/train/" \
  --validation-data-directories "./a1a/test/" \
  --root-output-directory "out" \
  --feature-shard-configurations "name=globalShard,feature.bags=features" \
  --coordinate-configurations "name=global,feature.shard=globalShard,min.partitions=4,optimizer=LBFGS,tolerance=1.0E-6,max.iter=50,regularization=L2,reg.weights=0.1|1|10|100" \
  --coordinate-update-sequence "global" \
  --coordinate-descent-iterations 1 \
  --training-task "LOGISTIC_REGRESSION"

177

answered Sep 30 '22 21:09

Dennis Jaheruddin

Related questions
                            
                                key not found: _PYSPARK_DRIVER_CALLBACK_HOST
                            
                                Error while using Hive context in spark : object hive is not a member of package org.apache.spark.sql
                            
                                Scala/Spark version compatibility
                            
                                Selecting only numeric/string columns names from a Spark DF in pyspark
                            
                                How to allocate more executors per worker in Standalone cluster mode?
                            
                                PySpark - Adding a Column from a list of values using a UDF
                            
                                spark partition data writing by timestamp
                            
                                Invalid Spark URL in local spark session
                            
                                UnsatisfiedLinkError: no snappyjava in java.library.path when running Spark MLLib Unit test within Intellij
                            
                                How can I efficiently read multiple json files into a Dataframe or JavaRDD?
                            
                                spark error RDD type not found when creating RDD
                            
                                What is the best way to define custom methods on a DataFrame?
                            
                                java.lang.NoClassDefFoundError: org/apache/spark/sql/SparkSession
                            
                                Apply same function to all fields of spark dataframe row
                            
                                Pyspark: Replacing value in a column by searching a dictionary
                            
                                pyspark and HDFS commands
                            
                                Making histogram with Spark DataFrame column
                            
                                Keep only duplicates from a DataFrame regarding some field
                            
                                how to cast all columns of dataframe to string
                            
                                Spark streaming multiple sources, reload dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Mixed Effects Models in Spark or other technology

Tags:

apache-spark

regression

mixed-models

skan

People also ask

Video Answer

1 Answers

Dennis Jaheruddin

Recent Activity

Donate For Us