Does CrossValidator in PySpark distribute the execution?

Tags:

I am playing with Machine Learning in PySpark and am using a RandomForestClassifier. I have used Sklearn till now. I am using CrossValidator to tune the parameters and get the best model. A sample code taken from Spark's website is below.

From what I have been reading, I do not understand whether spark distributes the parameter tuning as well or it is the same as in case of GridSearchCV of Sklearn.

Any help would really appreciated.

from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

# Prepare training documents, which are labeled.
training = spark.createDataFrame([
    (0, "a b c d e spark", 1.0),
    (1, "b d", 0.0),
    (2, "spark f g h", 1.0),
    (3, "hadoop mapreduce", 0.0),
    (4, "b spark who", 1.0),
    (5, "g d a y", 0.0),
    (6, "spark fly", 1.0),
    (7, "was mapreduce", 0.0),
    (8, "e spark program", 1.0),
    (9, "a e c l", 0.0),
    (10, "spark compile", 1.0),
    (11, "hadoop software", 0.0)
], ["id", "text", "label"])

# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])

# We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance.
# This will allow us to jointly choose parameters for all Pipeline stages.
# A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# We use a ParamGridBuilder to construct a grid of parameters to search over.
# With 3 values for hashingTF.numFeatures and 2 values for lr.regParam,
# this grid will have 3 x 2 = 6 parameter settings for CrossValidator to choose from.
paramGrid = ParamGridBuilder() \
    .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .build()

crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(),
                          numFolds=2)  # use 3+ folds in practice

# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)

900

asked Aug 21 '17 22:08

nEO

1 Answers

Spark 2.3+

SPARK-21911 included parallel model fitting. The level of parallelism is controlled with parallelism Param.

Spark < 2.3

It does not. Cross validation is implemented as a plain nested for loop:

for i in range(nFolds):
    ...
    for j in range(numModels):
        ...

Only the process of training individual models is distributed.

answered Sep 19 '22 14:09

Alper t. Turker

Related questions
                            
                                Spark mapPartitions vs transient lazy val
                            
                                Spark Dataframes: Skewed Partition after Join
                            
                                Increasing Parallellism in Spark Executor without increasing Cores
                            
                                ERROR ContextCleaner: Error in cleaning thread
                            
                                Adding Spark "Library" to a Scala project
                            
                                Understanding LDA in Spark
                            
                                Dimension mismatch error in Spark ML
                            
                                How do we specify maven dependencies in pyspark
                            
                                Does the shuffle step in a MapReduce program run in parallel with Mapping?
                            
                                warning:Multiple versions of scala libraries detected?
                            
                                How to filter after group by and aggregate in Spark dataframe?
                            
                                How to time Spark program execution speed
                            
                                spark importing data from oracle - java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
                            
                                Does Spark Supports With Clause?
                            
                                Spark persist temp view
                            
                                Spark job failing due to space issue
                            
                                How to deal with array<String> in spark dataframe?
                            
                                Low cpu usage while running a spark job
                            
                                How to use a predicate while reading from JDBC connection?
                            
                                using DataSet.repartition in Spark 2 - several tasks handle more than one partition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does CrossValidator in PySpark distribute the execution?

Tags:

parameters

machine-learning

apache-spark

pyspark

nEO

People also ask

1 Answers

Alper t. Turker

Recent Activity

Donate For Us