I have an <code>RDD[LabeledPoint]</code> intended to be used within a machine learning pipeline. How do we convert that <code>RDD</code> to a <code>DataSet</code>? Note the newer <code>spark.ml</code> apis require inputs in the <code>Dataset</code> format.

Here is an answer that traverses an extra step - the <code>DataFrame</code>. We use the <code>SQLContext</code> to create a <code>DataFrame</code> and then create a <code>DataSet</code> using the desired object type - in this case a <code>LabeledPoint</code>: <pre class="prettyprint lang-scala prettyprint-override"><code>val sqlContext = new SQLContext(sc) val pointsTrainDf = sqlContext.createDataFrame(training) val pointsTrainDs = pointsTrainDf.as[LabeledPoint] </code></pre> Update Ever heard of a <code>SparkSession</code> ? (neither had I until now..) So apparently the <code>SparkSession</code> is the Preferred Way (TM) in Spark 2.0.0 and moving forward. Here is the updated code for the new (spark) world order: Spark 2.0.0+ approaches Notice in both of the below approaches (simpler one of which credit @zero323) we have accomplished an important savings as compared to the <code>SQLContext</code> approach: no longer is it necessary to first create a <code>DataFrame</code>. <pre class="prettyprint lang-scala prettyprint-override"><code>val sparkSession = SparkSession.builder().getOrCreate() val pointsTrainDf = sparkSession.createDataset(training) val model = new LogisticRegression() .train(pointsTrainDs.as[LabeledPoint]) </code></pre> Second way for Spark 2.0.0+ Credit to @zero323 <pre class="prettyprint"><code>val spark: org.apache.spark.sql.SparkSession = ??? import spark.implicits._ val trainDs = training.toDS() </code></pre> Traditional Spark 1.X and earlier approach <pre class="prettyprint"><code>val sqlContext = new SQLContext(sc) // Note this is *deprecated* in 2.0.0 import sqlContext.implicits._ val training = splits(0).cache() val test = splits(1) val trainDs = training**.toDS()** </code></pre> See also: How to store custom objects in Dataset? by the esteemed @zero323 .

How to create a Spark Dataset from an RDD

Tags:

scala

dataset

apache-spark

apache-spark-dataset

I have an RDD[LabeledPoint] intended to be used within a machine learning pipeline. How do we convert that RDD to a DataSet? Note the newer spark.ml apis require inputs in the Dataset format.

558

asked May 29 '16 18:05

WestCoastProjects

1 Answers

Here is an answer that traverses an extra step - the DataFrame. We use the SQLContext to create a DataFrame and then create a DataSet using the desired object type - in this case a LabeledPoint:

val sqlContext = new SQLContext(sc)
val pointsTrainDf =  sqlContext.createDataFrame(training)
val pointsTrainDs = pointsTrainDf.as[LabeledPoint]

Update Ever heard of a SparkSession ? (neither had I until now..)

So apparently the SparkSession is the Preferred Way (TM) in Spark 2.0.0 and moving forward. Here is the updated code for the new (spark) world order:

Spark 2.0.0+ approaches

Notice in both of the below approaches (simpler one of which credit @zero323) we have accomplished an important savings as compared to the SQLContext approach: no longer is it necessary to first create a DataFrame.

val sparkSession =  SparkSession.builder().getOrCreate()
val pointsTrainDf =  sparkSession.createDataset(training)
val model = new LogisticRegression()
   .train(pointsTrainDs.as[LabeledPoint])

Second way for Spark 2.0.0+ Credit to @zero323

val spark: org.apache.spark.sql.SparkSession = ???
import spark.implicits._

val trainDs = training.toDS()

Traditional Spark 1.X and earlier approach

val sqlContext = new SQLContext(sc)  // Note this is *deprecated* in 2.0.0
import sqlContext.implicits._
val training = splits(0).cache()
val test = splits(1)
val trainDs = training**.toDS()**

See also: How to store custom objects in Dataset? by the esteemed @zero323 .

137

answered Oct 03 '22 13:10

WestCoastProjects

Related questions
                            
                                How can I share memory between two JVM instances?
                            
                                Cartesian product of two lists
                            
                                Why won't Scala optimize tail call with try/catch?
                            
                                reader writer state monad - how to run this scala code
                            
                                Sequencing an HList
                            
                                meaning of top level private class in scala
                            
                                Can this functionality be implemented with Haskell's type system?
                            
                                Scala updating Array elements
                            
                                How to effectively use Scala in a Spring MVC project?
                            
                                Functional implementation of Tarjan's Strongly Connected Components algorithm
                            
                                How strongly is scala tied to JVM?
                            
                                What to use in the face of deprecation of the scala.util.parsing.json._ package?
                            
                                Synthetic Function "##" in scala
                            
                                Scala overriding a non-abstract def with a var
                            
                                Understanding why "pimp my library" was defined that way in Scala
                            
                                How do I create an enum in scala that has an extra field
                            
                                What is the most mature library for building a Data Analytics Pipeline in Java/Scala for Hadoop?
                            
                                What is the difference between a.ne(null) and a != null in Scala?
                            
                                Scala - Writing Json object to file and reading it
                            
                                How to debug/run a single gatling simulation in IntelliJ IDEA without sbt command?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to create a Spark Dataset from an RDD

Tags:

scala

dataset

apache-spark

apache-spark-dataset

WestCoastProjects

People also ask

1 Answers

WestCoastProjects

Recent Activity

Donate For Us