I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can I use the pyspark like this <pre class="prettyprint"><code>from xgboost import XGBClassifier ... model = XGBClassifier() model.fit(X_train, y_train) pipeline = Pipeline(stages=[..., model, ...]) ... </code></pre> It is convenient to use the pipeline api, so can anybody give some advices? Thanks.

There is no XGBoost classifier in Apache Spark ML (as of version 2.3). Available models are listed here : https://spark.apache.org/docs/2.3.0/ml-classification-regression.html If you want to use XGBoost you should do it without pyspark (convert your spark dataframe to a pandas dataframe with <code>.toPandas()</code>) or use another algorithm (https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#module-pyspark.ml.classification). But if you really want to use XGBoost with pyspark, you'll have to dive into pyspark to implement a distributed XGBoost yourself. Here is an article where they do so : http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html

There is a maintained (used in production by several companies) distributed XGBoost library as mentioned above (https://github.com/dmlc/xgboost), however to use it from PySpark is a bit tricky, someone made a working pyspark wrapper for version 0.72 of the library, with 0.8 support in progress. See here https://medium.com/@bogdan.cojocar/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb, and https://github.com/dmlc/xgboost/issues/1698 for the full discussion. Make sure the xgboost jars are in your pyspark jar path.

How to use XGboost in PySpark Pipeline

Tags:

apache-spark

pyspark

xgboost

apache-spark-ml

apache-spark-mllib

I want to update my code of pyspark. In the pyspark, it must put the base model in a pipeline, the office demo of pipeline use the LogistictRegression as an base model. However, it seems not be able to use XGboost model in the pipeline api. How can I use the pyspark like this

from xgboost import XGBClassifier
...
model = XGBClassifier()
model.fit(X_train, y_train)
pipeline = Pipeline(stages=[..., model, ...])
...

It is convenient to use the pipeline api, so can anybody give some advices? Thanks.

305

asked May 30 '18 10:05

Daniel Du

Video Answer

2 Answers

There is no XGBoost classifier in Apache Spark ML (as of version 2.3). Available models are listed here : https://spark.apache.org/docs/2.3.0/ml-classification-regression.html

If you want to use XGBoost you should do it without pyspark (convert your spark dataframe to a pandas dataframe with .toPandas()) or use another algorithm (https://spark.apache.org/docs/2.3.0/api/python/pyspark.ml.html#module-pyspark.ml.classification).

But if you really want to use XGBoost with pyspark, you'll have to dive into pyspark to implement a distributed XGBoost yourself. Here is an article where they do so : http://dmlc.ml/2016/10/26/a-full-integration-of-xgboost-and-spark.html

103

answered Oct 06 '22 03:10

Pierre Gourseaud

There is a maintained (used in production by several companies) distributed XGBoost library as mentioned above (https://github.com/dmlc/xgboost), however to use it from PySpark is a bit tricky, someone made a working pyspark wrapper for version 0.72 of the library, with 0.8 support in progress.

See here https://medium.com/@bogdan.cojocar/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb, and https://github.com/dmlc/xgboost/issues/1698 for the full discussion.

Make sure the xgboost jars are in your pyspark jar path.

answered Oct 06 '22 01:10

Rafael

Related questions
                            
                                Apache Spark Parquet: Cannot build an empty group
                            
                                Partition a spark dataframe based on column value?
                            
                                Spark Dataframe Returning NULL when specifying a Schema
                            
                                What are the benefits of running multiple Spark tasks in the same JVM?
                            
                                What does "streaming" mean in Apache Spark and Apache Flink?
                            
                                PySpark, importing schema through JSON file
                            
                                Duplicated Spark Context with IntelliJ in Worksheet
                            
                                Implement a directed Graph as an undirected graph using GraphX
                            
                                How to calculate rolling median in PySpark using Window()?
                            
                                Find mean of pyspark array<double>
                            
                                How to run a spark example program in Intellij IDEA
                            
                                read files recursively from sub directories with spark from s3 or local filesystem
                            
                                Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]
                            
                                Converting multiple different columns to Map column with Spark Dataframe scala
                            
                                Apache Spark: "failed to launch org.apache.spark.deploy.worker.Worker" or Master
                            
                                Change output filename prefix for DataFrame.write()
                            
                                Mode of grouped data in (py)Spark
                            
                                What does "Correlated scalar subqueries must be Aggregated" mean?
                            
                                spark on yarn, Container exited with a non-zero exit code 143
                            
                                dataframe Spark scala explode json array

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With