How can I integrate xgboost in spark? (Python)

Tags:

I am trying to train a model using XGBoost on data I have on the hive, the data is too large and I cant convert it to pandas df, so I have to use XGBoost with spark df. When creating a XGBoostEstimator, an error occur:

TypeError: 'JavaPackage' object is not callable Exception AttributeError: "'NoneType' object has no attribute '_detach'" in ignored

I have no experience with xgboost for spark, I have tried a few tutorials online but none worked. I tried to covert to pandas df but the data is too large and I always get OutOfMemoryException from the Java wrapper (I also tried to look it up and the solution did not work for me, raising the executor memory).

The latest tutorial I was following is:

https://towardsdatascience.com/pyspark-and-xgboost-integration-tested-on-the-kaggle-titanic-dataset-4e75a568bdb

After giving up on the XGBoost module, I started using sparkxgb.

spark = create_spark_session('shai', 'dna_pipeline')
# sparkxgboost files 
spark.sparkContext.addPyFile('resources/sparkxgb.zip')

def create_spark_session(username=None, app_name="pipeline"):
    if username is not None:
        os.environ['HADOOP_USER_NAME'] = username

    return SparkSession \
        .builder \
        .master("yarn") \
        .appName(app_name) \
        .config(...) \
        .config(...) \
        .getOrCreate()

def train():
    train_df = spark.table('dna.offline_features_train_full')
    test_df = spark.table('dna.offline_features_test_full')

    from sparkxgb import XGBoostEstimator

    vectorAssembler = VectorAssembler() \
        .setInputCols(train_df.columns) \
        .setOutputCol("features")

    # This is where the program fails
    xgboost = XGBoostEstimator(
        featuresCol="features",
        labelCol="label",
        predictionCol="prediction"
    )

    pipeline = Pipeline().setStages([xgboost])
    pipeline.fit(train_df)

The full exception is:

Traceback (most recent call last):
  File "/home/elad/DNA/dna/dna/run.py", line 283, in <module>
    main()
  File "/home/elad/DNA/dna/dna/run.py", line 247, in main
    offline_model = train_model(True, home_dir=config['home_dir'], hdfs_client=client)
  File "/home/elad/DNA/dna/dna/run.py", line 222, in train_model
    model = train(offline_mode=offline, spark=spark)
  File "/home/elad/DNA/dna/dna/model/xgboost_train.py", line 285, in train
    predictionCol="prediction"
  File "/home/elad/.conda/envs/DNAenv/lib/python2.7/site-packages/pyspark/__init__.py", line 105, in wrapper
    return func(self, **kwargs)
  File "/tmp/spark-7781039b-6821-42be-96e0-ca4005107318/userFiles-70b3d1de-a78c-4fac-b252-2f99a6761b32/sparkxgb.zip/sparkxgb/xgboost.py", line 115, in __init__
  File "/home/elad/.conda/envs/DNAenv/lib/python2.7/site-packages/pyspark/ml/wrapper.py", line 63, in _new_java_obj
    return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable
Exception AttributeError: "'NoneType' object has no attribute '_detach'" in <bound method XGBoostEstimator.__del__ of XGBoostEstimator_4f54b37156fb0a113233> ignored

I have no idea why this exception happens nor do I know how to properly integrate sparkxgb into my code.

Help would be appreciated.

thanks

889

asked Sep 15 '19 07:09

Elad Cohen

1 Answers

After a day of debugging the hell out of this module, the problem was just submitting the jars incorrectly. I downloaded the jars locally and pyspark-submit them using:

PYSPARK_SUBMIT_ARGS=--jars resources/xgboost4j-0.72.jar,resources/xgboost4j-spark-0.72.jar

This fixed the problem.

156

answered Sep 18 '22 10:09

Elad Cohen

Related questions
                            
                                How to convert torch int64 to torch LongTensor?
                            
                                when restoring from a checkpoint, how can I change the data type of the parameters?
                            
                                Seaborn Jointplot Change Figsize [duplicate]
                            
                                How do I use youtube-dl's --add-header option?
                            
                                Negative accuracy score in regression models with Scikit-Learn
                            
                                Does networkx has a function to calculate the length of the path considering weights?
                            
                                2d numpy array, making each value the sum of the 3x3 square it is centered at
                            
                                Best practice when add a new unique field to an existing django model
                            
                                Python, list of tuples split into dictionaries
                            
                                get all unicode variations of a latin character
                            
                                How to count consecutive repetitions in a pandas series
                            
                                How to use flask_jwt_extended with blueprints?
                            
                                how to convert perreplica to tensor?
                            
                                How to plot text clusters?
                            
                                Dictionary to Dataframe Error: "If using all scalar values, you must pass an index"
                            
                                Why do these two functions have the same bytecode when disassembled under dis.dis?
                            
                                DataFrame to list of list without change in data type of values
                            
                                cannot import name 'ft2font' from 'matplotlib' on windows10
                            
                                Decreasing the time necessary to enter the coefficients of a matrix
                            
                                How to install the specific version of Python with Anaconda?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I integrate xgboost in spark? (Python)

Tags:

python

apache-spark

pyspark

xgboost

Elad Cohen

People also ask

1 Answers

Elad Cohen

Recent Activity

Donate For Us