How to distribute xgboost module for use in spark?

Tags:

I would like to use a pretrained xgboost classifier in pyspark but the nodes on the cluster don't have the xgboost module installed. I can pickle the classifier I have trained and broadcast it but this isn't enough as I still need the module to be loaded at each cluster node.

I can't install it on the cluster nodes as I don't have root and there is no shared file system.

How can I distribute the xgboost classifier for use in spark?

I have an egg for xgboost. Could something like http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-td7059.html or https://stackoverflow.com/a/24686708/2179021 work?

401

asked Sep 24 '16 03:09

graffe

1 Answers

There is a really good blog post from Cloudera explaining this matter. All credits go to them.

But just to answer your question in short - no, it's not possible. Any complex 3rd party dependency needs to be installed on each node of your cluster and configured properly. For simple modules/dependences one might create *.egg, *.zip or *.py files and supply them to the cluster with --py-files flag in spark-submit.

However, xgboost is a numerical package that depends heavily not only on other Python packages, but also specific C++ library/compiler - which is low level. If you were to supply compiled code to the cluster, you could encounter errors arising from different hardware architecture. Adding the fact that clusters are usually heterogenous in terms of hardware, doing such thing would be a very bad thing.

answered Sep 27 '22 17:09

bear911

Related questions
                            
                                Latent Dirichlet allocation (LDA) in Spark - replicate model
                            
                                Apache Spark Executors Dead - is this the expected behaviour?
                            
                                Spark concurrent writes on same HDFS location
                            
                                Kappa architecture: when insert to batch/analytic serving layer happens
                            
                                403 Error while accessing s3a using Spark
                            
                                AWS EMR: Pyspark: Rdd: mappartitions: Could not find valid SPARK_HOME while searching: Spark closures
                            
                                saveAsTextFile method in spark
                            
                                Connect to spark through a SOCKS proxy
                            
                                How do I submit a Spark jar to a EMR cluster?
                            
                                Where to download documentation for Spark?
                            
                                SparkR Error in sparkR.init(master="local") in RStudio
                            
                                Multiple IP addresses and Host Names used by Spark Driver and Master
                            
                                java.util.concurrent.RejectedExecutionException in Spark although driver/client has precisely same version as Server
                            
                                Writing an RDD to multiple files in PySpark
                            
                                Can sample weight be used in Spark MLlib Random Forest training?
                            
                                Manually stopping Spark Workers
                            
                                Spark Streaming: Broadcast variables, java.lang.ClassCastException
                            
                                How to run custom Python script on Jupyter Notebook launch (to boot Spark)?
                            
                                saveToCassandra with spark-cassandra connector throws java.lang.ClassCastException
                            
                                How to load a PMML model?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to distribute xgboost module for use in spark?

Tags:

machine-learning

apache-spark

pyspark

xgboost

graffe

People also ask

1 Answers

bear911

Recent Activity

Donate For Us