I understand that one can convert an RDD to a Dataset using rdd.toDS
. However there also exists rdd.toDF
. Is there really any benefit of one over the other?
After playing with the Dataset API for a day, I find out that almost any operation takes me out to a DataFrame (for instance withColumn
). After converting an RDD with toDS
, I often find out that another conversion to a DataSet is needed, because something brought me to a DataFrame again.
Am I using the API wrongly? Should I stick with .toDF and only convert to a DataSet in the end of a chain of operations? Or is there a benefit to using toDS earlier?
Here is a small concrete example
spark
.read
.schema (...)
.json (...)
.rdd
.zipWithUniqueId
.map[(Integer,String,Double)] { case (row,id) => ... }
.toDS // now with a Dataset API (should use toDF here?)
.withColumnRenamed ("_1", "id" ) // now back to a DataFrame, not type safe :(
.withColumnRenamed ("_2", "text")
.withColumnRenamed ("_2", "overall")
.as[ParsedReview] // back to a Dataset
Michael Armburst nicely explained that shift to dataset and dataframe and the difference between the two. Basically in spark 2.x they converged dataset and dataframe API into one with slight difference. "DataFrame is just DataSet of generic row objects. When you don't know all the fields, DF is the answer".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With