Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to distribute xgboost module for use in spark?

I would like to use a pretrained xgboost classifier in pyspark but the nodes on the cluster don't have the xgboost module installed. I can pickle the classifier I have trained and broadcast it but this isn't enough as I still need the module to be loaded at each cluster node.

I can't install it on the cluster nodes as I don't have root and there is no shared file system.

How can I distribute the xgboost classifier for use in spark?


I have an egg for xgboost. Could something like http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-td7059.html or https://stackoverflow.com/a/24686708/2179021 work?

like image 401
graffe Avatar asked Sep 24 '16 03:09

graffe


People also ask

How does XGBoost spark work?

XGBoost will automatically repartition the input data to the number of XGBoost workers, so the input data should be repartitioned in Spark to avoid the additional work in repartitioning the data again. As a hypothetical example, when reading from a single CSV file, it is common to repartition the DataFrame.


1 Answers

There is a really good blog post from Cloudera explaining this matter. All credits go to them.

But just to answer your question in short - no, it's not possible. Any complex 3rd party dependency needs to be installed on each node of your cluster and configured properly. For simple modules/dependences one might create *.egg, *.zip or *.py files and supply them to the cluster with --py-files flag in spark-submit.

However, xgboost is a numerical package that depends heavily not only on other Python packages, but also specific C++ library/compiler - which is low level. If you were to supply compiled code to the cluster, you could encounter errors arising from different hardware architecture. Adding the fact that clusters are usually heterogenous in terms of hardware, doing such thing would be a very bad thing.

like image 94
bear911 Avatar answered Sep 27 '22 17:09

bear911