I would like to use a pretrained xgboost classifier in pyspark but the nodes on the cluster don't have the xgboost module installed. I can pickle the classifier I have trained and broadcast it but this isn't enough as I still need the module to be loaded at each cluster node.
I can't install it on the cluster nodes as I don't have root and there is no shared file system.
How can I distribute the xgboost classifier for use in spark?
I have an egg for xgboost. Could something like http://apache-spark-user-list.1001560.n3.nabble.com/Loading-Python-libraries-into-Spark-td7059.html or https://stackoverflow.com/a/24686708/2179021 work?
XGBoost will automatically repartition the input data to the number of XGBoost workers, so the input data should be repartitioned in Spark to avoid the additional work in repartitioning the data again. As a hypothetical example, when reading from a single CSV file, it is common to repartition the DataFrame.
There is a really good blog post from Cloudera explaining this matter. All credits go to them.
But just to answer your question in short - no, it's not possible. Any complex 3rd party dependency needs to be installed on each node of your cluster and configured properly. For simple modules/dependences one might create *.egg
, *.zip
or *.py
files and supply them to the cluster with --py-files
flag in spark-submit
.
However, xgboost
is a numerical package that depends heavily not only on other Python
packages, but also specific C++
library/compiler - which is low level. If you were to supply compiled code to the cluster, you could encounter errors arising from different hardware architecture. Adding the fact that clusters are usually heterogenous in terms of hardware, doing such thing would be a very bad thing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With