Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I integrate xgboost in spark? (Python)

I am trying to train a model using XGBoost on data I have on the hive, the data is too large and I cant convert it to pandas df, so I have to use XGBoost with spark df. When creating a XGBoostEstimator, an error occur:

TypeError: 'JavaPackage' object is not callable Exception AttributeError: "'NoneType' object has no attribute '_detach'" in ignored

I have no experience with xgboost for spark, I have tried a few tutorials online but none worked. I tried to covert to pandas df but the data is too large and I always get OutOfMemoryException from the Java wrapper (I also tried to look it up and the solution did not work for me, raising the executor memory).

The latest tutorial I was following is:

https://towardsdatascience.com/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb

After giving up on the XGBoost module, I started using sparkxgb.

spark = create_spark_session('shai', 'dna_pipeline')
# sparkxgboost files 
spark.sparkContext.addPyFile('resources/sparkxgb.zip')

def create_spark_session(username=None, app_name="pipeline"):
    if username is not None:
        os.environ['HADOOP_USER_NAME'] = username

    return SparkSession \
        .builder \
        .master("yarn") \
        .appName(app_name) \
        .config(...) \
        .config(...) \
        .getOrCreate()

def train():
    train_df = spark.table('dna.offline_features_train_full')
    test_df = spark.table('dna.offline_features_test_full')

    from sparkxgb import XGBoostEstimator

    vectorAssembler = VectorAssembler() \
        .setInputCols(train_df.columns) \
        .setOutputCol("features")

    # This is where the program fails
    xgboost = XGBoostEstimator(
        featuresCol="features",
        labelCol="label",
        predictionCol="prediction"
    )

    pipeline = Pipeline().setStages([xgboost])
    pipeline.fit(train_df)

The full exception is:

Traceback (most recent call last):
  File "/home/elad/DNA/dna/dna/run.py", line 283, in <module>
    main()
  File "/home/elad/DNA/dna/dna/run.py", line 247, in main
    offline_model = train_model(True, home_dir=config['home_dir'], hdfs_client=client)
  File "/home/elad/DNA/dna/dna/run.py", line 222, in train_model
    model = train(offline_mode=offline, spark=spark)
  File "/home/elad/DNA/dna/dna/model/xgboost_train.py", line 285, in train
    predictionCol="prediction"
  File "/home/elad/.conda/envs/DNAenv/lib/python2.7/site-packages/pyspark/__init__.py", line 105, in wrapper
    return func(self, **kwargs)
  File "/tmp/spark-7781039b-6821-42be-96e0-ca4005107318/userFiles-70b3d1de-a78c-4fac-b252-2f99a6761b32/sparkxgb.zip/sparkxgb/xgboost.py", line 115, in __init__
  File "/home/elad/.conda/envs/DNAenv/lib/python2.7/site-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj
    return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable
Exception AttributeError: "'NoneType' object has no attribute '_detach'" in <bound method XGBoostEstimator.__del__ of XGBoostEstimator_4f54b37156fb0a113233> ignored

I have no idea why this exception happens nor do I know how to properly integrate sparkxgb into my code.

Help would be appreciated.

thanks

like image 889
Elad Cohen Avatar asked Sep 15 '19 07:09

Elad Cohen


People also ask

Is it possible to use XGBoost in spark?

It works pretty much out of the box in pyspark, you can read more here Newer Apache Spark (2.3.0) version does not have XGBoost. You should try with Pyspark. You must convert your Spark dataframe to pandas dataframe. This is excellent article that gives workflow and explanation xgboost and spark

What is xgboost4j-spark in Python?

GitHub - sllynn/spark-xgboost: A Python wrapper for XGBoost4J-Spark classes. Spark users can use XGBoost for classification and regression tasks in a distributed environment through the excellent XGBoost4J-Spark library. As of July 2020, this integration only exposes a Scala API.

Can xgboost4j-spark be used in a Scala pipeline?

XGBoost is currently one of the most popular machine learning libraries and distributed training is becoming more frequently required to accommodate the rapidly increasing size of datasets. To utilize distributed training on a Spark cluster, the XGBoost4J-Spark package can be used in Scala pipelines but presents issues with Python pipelines.

Can I use XGBoost for classification and regression tasks?

Spark users can use XGBoost for classification and regression tasks in a distributed environment through the excellent XGBoost4J-Spark library. As of July 2020, this integration only exposes a Scala API. A PR is open on the main XGBoost repository to add a Python equivalent, but this is still in draft.


1 Answers

After a day of debugging the hell out of this module, the problem was just submitting the jars incorrectly. I downloaded the jars locally and pyspark-submit them using:

PYSPARK_SUBMIT_ARGS=--jars resources/xgboost4j-0.72.jar,resources/xgboost4j-spark-0.72.jar

This fixed the problem.

like image 156
Elad Cohen Avatar answered Sep 18 '22 10:09

Elad Cohen