I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline. Here is sample data: <pre class="prettyprint"><code>age,hours_per_week,education,sex,salaryRange 38,40,"hs-grad","male","A" 28,40,"bachelors","female","A" 52,45,"hs-grad","male","B" 31,50,"masters","female","B" 42,40,"bachelors","male","B" </code></pre> age and hours_per_week are integers while other features including label salaryRange are categorical (String) Loading this csv file (lets call it sample.csv) can be done by Spark csv library like this: <pre class="prettyprint"><code>val data = sqlContext.csvFile("/home/dusan/sample.csv") </code></pre> By default all columns are imported as string so we need to change "age" and "hours_per_week" to Int: <pre class="prettyprint"><code>val toInt = udf[Int, String]( _.toInt) val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week"))) </code></pre> Just to check how schema looks now: <pre class="prettyprint"><code>scala> dataFixed.printSchema root |-- age: integer (nullable = true) |-- hours_per_week: integer (nullable = true) |-- education: string (nullable = true) |-- sex: string (nullable = true) |-- salaryRange: string (nullable = true) </code></pre> Then lets set the cross validator and pipeline: <pre class="prettyprint"><code>val rf = new RandomForestClassifier() val pipeline = new Pipeline().setStages(Array(rf)) val cv = new CrossValidator().setNumFolds(10).setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator) </code></pre> Error shows up when running this line: <pre class="prettyprint"><code>val cmModel = cv.fit(dataFixed) </code></pre> java.lang.IllegalArgumentException: Field "features" does not exist. It is possible to set label column and feature column in RandomForestClassifier ,however I have 4 columns as predictors (features) not only one. How I should organize my data frame so it has label and features columns organized correctly? For your convenience here is full code : <pre class="prettyprint"><code>import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.tuning.CrossValidator import org.apache.spark.ml.Pipeline import org.apache.spark.sql.DataFrame import org.apache.spark.sql.functions._ import org.apache.spark.mllib.linalg.{Vector, Vectors} object SampleClassification { def main(args: Array[String]): Unit = { //set spark context val conf = new SparkConf().setAppName("Simple Application").setMaster("local"); val sc = new SparkContext(conf) val sqlContext = new org.apache.spark.sql.SQLContext(sc) import sqlContext.implicits._ import com.databricks.spark.csv._ //load data by using databricks "Spark CSV Library" val data = sqlContext.csvFile("/home/dusan/sample.csv") //by default all columns are imported as string so we need to change "age" and "hours_per_week" to Int val toInt = udf[Int, String]( _.toInt) val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week"))) val rf = new RandomForestClassifier() val pipeline = new Pipeline().setStages(Array(rf)) val cv = new CrossValidator().setNumFolds(10).setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator) // this fails with error //java.lang.IllegalArgumentException: Field "features" does not exist. val cmModel = cv.fit(dataFixed) } } </code></pre> Thanks for help!

As of Spark 1.4, you can use Transformer org.apache.spark.ml.feature.VectorAssembler. Just provide column names you want to be features. <pre class="prettyprint"><code>val assembler = new VectorAssembler() .setInputCols(Array("col1", "col2", "col3")) .setOutputCol("features") </code></pre> and add it to your pipeline.

How to create correct data frame for classification in Spark ML

Tags:

scala

apache-spark

apache-spark-sql

apache-spark-mllib

I am trying to run random forest classification by using Spark ML api but I am having issues with creating right data frame input into pipeline.

Here is sample data:

age,hours_per_week,education,sex,salaryRange 38,40,"hs-grad","male","A" 28,40,"bachelors","female","A" 52,45,"hs-grad","male","B" 31,50,"masters","female","B" 42,40,"bachelors","male","B"

age and hours_per_week are integers while other features including label salaryRange are categorical (String)

Loading this csv file (lets call it sample.csv) can be done by Spark csv library like this:

val data = sqlContext.csvFile("/home/dusan/sample.csv")

By default all columns are imported as string so we need to change "age" and "hours_per_week" to Int:

val toInt    = udf[Int, String]( _.toInt) val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week")))

Just to check how schema looks now:

scala> dataFixed.printSchema root  |-- age: integer (nullable = true)  |-- hours_per_week: integer (nullable = true)  |-- education: string (nullable = true)  |-- sex: string (nullable = true)  |-- salaryRange: string (nullable = true)

Then lets set the cross validator and pipeline:

val rf = new RandomForestClassifier() val pipeline = new Pipeline().setStages(Array(rf))  val cv = new CrossValidator().setNumFolds(10).setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)

Error shows up when running this line:

val cmModel = cv.fit(dataFixed)

java.lang.IllegalArgumentException: Field "features" does not exist.

It is possible to set label column and feature column in RandomForestClassifier ,however I have 4 columns as predictors (features) not only one.

How I should organize my data frame so it has label and features columns organized correctly?

For your convenience here is full code :

import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.ml.classification.RandomForestClassifier import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator import org.apache.spark.ml.tuning.CrossValidator import org.apache.spark.ml.Pipeline import org.apache.spark.sql.DataFrame  import org.apache.spark.sql.functions._ import org.apache.spark.mllib.linalg.{Vector, Vectors}   object SampleClassification {    def main(args: Array[String]): Unit = {      //set spark context     val conf = new SparkConf().setAppName("Simple Application").setMaster("local");     val sc = new SparkContext(conf)     val sqlContext = new org.apache.spark.sql.SQLContext(sc)      import sqlContext.implicits._     import com.databricks.spark.csv._      //load data by using databricks "Spark CSV Library"      val data = sqlContext.csvFile("/home/dusan/sample.csv")      //by default all columns are imported as string so we need to change "age" and  "hours_per_week" to Int     val toInt    = udf[Int, String]( _.toInt)     val dataFixed = data.withColumn("age", toInt(data("age"))).withColumn("hours_per_week",toInt(data("hours_per_week")))       val rf = new RandomForestClassifier()      val pipeline = new Pipeline().setStages(Array(rf))      val cv = new CrossValidator().setNumFolds(10).setEstimator(pipeline).setEvaluator(new BinaryClassificationEvaluator)      // this fails with error     //java.lang.IllegalArgumentException: Field "features" does not exist.     val cmModel = cv.fit(dataFixed)    }  }

Thanks for help!

837

asked Jun 24 '15 14:06

Dusan Grubjesic

Video Answer

1 Answers

As of Spark 1.4, you can use Transformer org.apache.spark.ml.feature.VectorAssembler. Just provide column names you want to be features.

val assembler = new VectorAssembler()   .setInputCols(Array("col1", "col2", "col3"))   .setOutputCol("features")

and add it to your pipeline.

118

answered Sep 26 '22 08:09

WeiChing 林煒清

Related questions
                            
                                Is there any game engine in Scala? [closed]
                            
                                Why avoid subtyping?
                            
                                Iterate over lines in a file in parallel (Scala)?
                            
                                Concatenating datasets of different RDDs in Apache spark using scala
                            
                                Real World Functional Programming in Scala
                            
                                Difference between MutableList and ListBuffer
                            
                                How to prevent java.lang.OutOfMemoryError: PermGen space? [duplicate]
                            
                                Why is a `val` inside an `object` not automatically final?
                            
                                How do I easily convert from one collection type to another during a filter, map, flatMap in Scala?
                            
                                What is the difference between "Future.successful(None)" and "Future(None)"
                            
                                How does Scala's Vector work?
                            
                                running scala apps with java -jar
                            
                                Why is my Scala tail-recursion faster than the while loop?
                            
                                How to make a right-associative infix operator?
                            
                                How to run sbt multiple command in interactive mode as one command? [duplicate]
                            
                                Add element to a list In Scala
                            
                                Scala safe way of converting String to Enumeration value
                            
                                How does | (pipe) in pattern matching work?
                            
                                How to flatten a List of Futures in Scala
                            
                                If an Int can't be null, what does null.asInstanceOf[Int] mean?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With