How to vectorize DataFrame columns for ML algorithms?

Tags:

have a DataFrame with some categorical string values (e.g uuid|url|browser).

I would to convert it in a double to execute an ML algorithm that accept double matrix.

As convertion method I used StringIndexer (spark 1.4) that map my string values to double values, so I defined a function like this:

def str(arg: String, df:DataFrame) : DataFrame =
   (
    val indexer = new StringIndexer().setInputCol(arg).setOutputCol(arg+"_index")
    val newDF = indexer.fit(df).transform(df)
    return newDF
   )

Now the issue is that i would iterate foreach column of a df, call this function and add (or convert) the original string column in the parsed double column, so the result would be:

Initial df:

[String: uuid|String: url| String: browser]

Final df:

[String: uuid|Double: uuid_index|String: url|Double: url_index|String: browser|Double: Browser_index]

Thanks in advance

326

asked Sep 02 '15 15:09

fase_jhn

1 Answers

You can simply foldLeft over the Array of columns:

val transformed: DataFrame = df.columns.foldLeft(df)((df, arg) => str(arg, df))

Still, I will argue that it is not a good approach. Since src discards StringIndexerModel it cannot be used when you get new data. Because of that I would recommend using Pipeline:

import org.apache.spark.ml.Pipeline

val transformers: Array[org.apache.spark.ml.PipelineStage] = df.columns.map(
   cname => new StringIndexer()
     .setInputCol(cname)
     .setOutputCol(s"${cname}_index")
)

// Add the rest of your pipeline like VectorAssembler and algorithm
val stages: Array[org.apache.spark.ml.PipelineStage] = transformers ++ ???

val pipeline = new Pipeline().setStages(stages)
val model = pipeline.fit(df)
model.transform(df)

VectorAssembler can be included like this:

val assembler  = new VectorAssembler()
    .setInputCols(df.columns.map(cname => s"${cname}_index"))
    .setOutputCol("features")

val stages = transformers :+ assembler

You could also use RFormula, which is less customizable, but much more concise:

import org.apache.spark.ml.feature.RFormula

val rf = new RFormula().setFormula(" ~ uuid + url + browser - 1")
val rfModel = rf.fit(dataset)
rfModel.transform(dataset)

122

answered Oct 05 '22 16:10

zero323

Related questions
                            
                                why AnyVal can be converted into AnyRef at run time in Scala?
                            
                                Scala equivalent of Python's "in" operator for sets?
                            
                                Scala trait function: return instance of derived type
                            
                                Scala for ( ) vs for { }
                            
                                How to add Options to a List
                            
                                Difference between Array and List initialization in Scala
                            
                                Serialization Exception on spark
                            
                                Scala - String to Url
                            
                                Scala partial sum with current and all past elements in the list
                            
                                WeakTypeTag v. TypeTag
                            
                                Option.map (null) returns Some (null)
                            
                                Importance of Akka Routers
                            
                                What is monad analog in Java?
                            
                                Using `firstOption` with slick 3
                            
                                How to make scalatest matcher to ignore white-spaces when compare two strings?
                            
                                Using shapeless scala to merge the fields of two different case classes
                            
                                read application.conf from build.sbt
                            
                                "Plugin Scala is incompatible with this installation" error with IntelliJ 14
                            
                                SBT 0.13.0 - can't expand macros compiled by previous versions of Scala
                            
                                Is it possible to make an Akka HTTP core client request inside an Actor?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to vectorize DataFrame columns for ML algorithms?

Tags:

scala

apache-spark

apache-spark-ml

apache-spark-mllib

fase_jhn

People also ask

1 Answers

zero323

Recent Activity

Donate For Us