I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can I use the pyspark like this
from xgboost import XGBClassifier
...
model = XGBClassifier()
model.fit(X_train, y_train)
pipeline = Pipeline(stages=[..., model, ...])
...
It is convenient to use the pipeline api, so can anybody give some advices? Thanks.
As usual, you start by importing the library xgboost and other important libraries that you will be using for building the model. Note you can install python libraries like xgboost on your system using pip install xgboost on cmd. Separate the target variable and rest of the variables using . iloc to subset the data.
The built-in distributed XGBoost algorithm works on numerical tabular data. Each row of a dataset represents one instance, and each column of a dataset represents a feature value. The target column represents the value you want to predict.
XGBoost has frameworks for various languages, including Python, and it integrates nicely with the commonly used scikit-learn machine learning framework used by Python data scientists. It can be used to solve classification and regression problems, so is suitable for the vast majority of common data science challenges.
There is no XGBoost classifier in Apache Spark ML (as of version 2.3). Available models are listed here : https://spark.apache.org/docs/2.3.0/ml-classification-regression.html
If you want to use XGBoost you should do it without pyspark (convert your spark dataframe to a pandas dataframe with .toPandas()
) or use another algorithm (https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#module-pyspark.ml.classification).
But if you really want to use XGBoost with pyspark, you'll have to dive into pyspark to implement a distributed XGBoost yourself. Here is an article where they do so : http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html
There is a maintained (used in production by several companies) distributed XGBoost library as mentioned above (https://github.com/dmlc/xgboost), however to use it from PySpark is a bit tricky, someone made a working pyspark wrapper for version 0.72 of the library, with 0.8 support in progress.
See here https://medium.com/@bogdan.cojocar/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb, and https://github.com/dmlc/xgboost/issues/1698 for the full discussion.
Make sure the xgboost jars are in your pyspark jar path.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With