From DataFrame to RDD[LabeledPoint]

Tags:

I am trying to implement a document classifier using Apache Spark MLlib and I am having some problems representing the data. My code is the following:

import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.sql.types.{StringType, StructField, StructType}
import org.apache.spark.ml.feature.Tokenizer
import org.apache.spark.ml.feature.HashingTF
import org.apache.spark.ml.feature.IDF

val sql = new SQLContext(sc)

// Load raw data from a TSV file
val raw = sc.textFile("data.tsv").map(_.split("\t").toSeq)

// Convert the RDD to a dataframe
val schema = StructType(List(StructField("class", StringType), StructField("content", StringType)))
val dataframe = sql.createDataFrame(raw.map(row => Row(row(0), row(1))), schema)

// Tokenize
val tokenizer = new Tokenizer().setInputCol("content").setOutputCol("tokens")
val tokenized = tokenizer.transform(dataframe)

// TF-IDF
val htf = new HashingTF().setInputCol("tokens").setOutputCol("rawFeatures").setNumFeatures(500)
val tf = htf.transform(tokenized)
tf.cache
val idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
val idfModel = idf.fit(tf)
val tfidf = idfModel.transform(tf)

// Create labeled points
val labeled = tfidf.map(row => LabeledPoint(row.getDouble(0), row.get(4)))

I need to use dataframes to generate the tokens and create the TF-IDF features. The problem appears when I try to convert this dataframe to a RDD[LabeledPoint]. I map the dataframe rows, but the get method of Row return an Any type, not the type defined on the dataframe schema (Vector). Therefore, I cannot construct the RDD I need to train a ML model.

What is the best option to get a RDD[LabeledPoint] after calculating a TF-IDF?

621

asked Jun 18 '15 21:06

Miguel

1 Answers

Casting the object worked for me.

Try:

// Create labeled points
val labeled = tfidf.map(row => LabeledPoint(row.getDouble(0), row(4).asInstanceOf[Vector]))

150

answered Oct 05 '22 13:10

zzztimbo

Related questions
                            
                                Security of scala runtime
                            
                                Can a Scala "extractor" use generics on unapply?
                            
                                Spark: Difference between collect(), take() and show() outputs after conversion toDF
                            
                                How to implement DAO in Scala?
                            
                                Which guarantees do Scala's singletons have regarding serialization?
                            
                                Is the Scala compiler reentrant?
                            
                                Read an unsupported mix of union types from an Avro file in Apache Spark
                            
                                Scala: Overriding Generic Java Methods II
                            
                                How to get a List of (immutable and mutable) Sets in scala?
                            
                                Understanding scala's _ vs Any/Nothing
                            
                                how to auto-reload changed scala classes into SBT REPL
                            
                                Is it possible to call an overridden method from self type?
                            
                                How do you impose scala code coverage specifically for integration tests?
                            
                                What is the Scala equivalent of F#'s async workflows?
                            
                                Change priority of items in a priority queue
                            
                                Asking for a type's kind in Scala vs Haskell
                            
                                How exactly does Play framework 2.0 controllers / Async work?
                            
                                ScalaTest: pass command line arguments to ScalaTest maven goal
                            
                                Override sbt default resolvers with authenticated repo?
                            
                                scala slick one-to-many collections

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

From DataFrame to RDD[LabeledPoint]

Tags:

scala

apache-spark

apache-spark-mllib

Miguel

People also ask

1 Answers

zzztimbo

Recent Activity

Donate For Us