How to use spark quantilediscretizer on multiple columns

Tags:

All,

I have a ml pipeline setup as below

import org.apache.spark.ml.feature.QuantileDiscretizer
import org.apache.spark.sql.types.{StructType,StructField,DoubleType}    
import org.apache.spark.ml.Pipeline
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import scala.util.Random

val nRows = 10000
val nCols = 1000
val data = sc.parallelize(0 to nRows-1).map { _ => Row.fromSeq(Seq.fill(nCols)(Random.nextDouble)) }
val schema = StructType((0 to nCols-1).map { i => StructField("C" + i, DoubleType, true) } )
val df = spark.createDataFrame(data, schema)
df.cache()

//Get continuous feature name and discretize them
val continuous = df.dtypes.filter(_._2 == "DoubleType").map (_._1)
val discretizers = continuous.map(c => new QuantileDiscretizer().setInputCol(c).setOutputCol(s"${c}_disc").setNumBuckets(3).fit(df))
val pipeline = new Pipeline().setStages(discretizers)
val model = pipeline.fit(df)

When i run this, spark seems to setup each discretizer as a separate job. Is there a way to run all the discretizers as a single job with or without a pipeline? Thanks for the help, appreciate it.

314

asked Apr 26 '17 16:04

sramalingam24

1 Answers

support for this feature has been added in Spark 2.3.0. See release docs

Multiple column support for several feature transformers:
- [SPARK-13030]: OneHotEncoderEstimator (Scala/Java/Python)
- [SPARK-22397]: QuantileDiscretizer (Scala/Java)
- [SPARK-20542]: Bucketizer (Scala/Java/Python)

You can now use setInputCols and setOutputCols to specify multiple columns, although it seems not to be yet reflected in the official docs. The performance has been greatly increased with this new patch when compared to dealing with each column one job at a time.

Your example may be adapted as follows:

import org.apache.spark.ml.feature.QuantileDiscretizer
import org.apache.spark.sql.types.{StructType,StructField,DoubleType}    
import org.apache.spark.ml.Pipeline
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import scala.util.Random

val nRows = 10000
val nCols = 1000
val data = sc.parallelize(0 to nRows-1).map { _ => Row.fromSeq(Seq.fill(nCols)(Random.nextDouble)) }
val schema = StructType((0 to nCols-1).map { i => StructField("C" + i, DoubleType, true) } )
val df = spark.createDataFrame(data, schema)
df.cache()

//Get continuous feature name and discretize them
val continuous = df.dtypes.filter(_._2 == "DoubleType").map (_._1)

val discretizer = new QuantileDiscretizer()
  .setInputCols(continuous)
  .setOutputCols(continuous.map(c => s"${c}_disc"))
  .setNumBuckets(3)

val pipeline = new Pipeline().setStages(Array(discretizer))
val model = pipeline.fit(df)
model.transform(df)

answered Sep 28 '22 11:09

jarias

Related questions
                            
                                Why is there no Functor instance for Array in Scalaz
                            
                                Scala Type Syntax
                            
                                Understanding flatMap declaration in List
                            
                                Scaladoc: @group tag not showing in API documentation
                            
                                HashMap in scala.collection.mutable is invariant but immutable.HashMap is covariant, why?
                            
                                How to specify indentations on multiline parameter lists in IntelliJ Scala?
                            
                                Obtaining the client IP in Akka-http
                            
                                What is the difference between Future and future?
                            
                                Scala what is the difference between defining a method in the class instead on the companion object
                            
                                Could not find implicit value while using Context Bound
                            
                                scala case class too many fields
                            
                                How to retrieve the column having datatype as "list" from the table of Cassandra?
                            
                                An object with unapply working in middle of a case statement
                            
                                In Spark Streaming, is there a way to detect when a batch has finished?
                            
                                Scala override method with subclass as parameter type
                            
                                Error using reactivemongo 0.12.1 with play 2.5.X
                            
                                Unable to access file in relative path in Scala for test resource
                            
                                How to construct an actor together with its wrapper?
                            
                                How can I write and read an empty case class with play-json?
                            
                                How to map struct in DataFrame to case class?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use spark quantilediscretizer on multiple columns

Tags:

dictionary

scala

apache-spark

pipeline

quantile

sramalingam24

People also ask

1 Answers

jarias

Recent Activity

Donate For Us