Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark's toDS vs to DF

I understand that one can convert an RDD to a Dataset using rdd.toDS. However there also exists rdd.toDF. Is there really any benefit of one over the other?

After playing with the Dataset API for a day, I find out that almost any operation takes me out to a DataFrame (for instance withColumn). After converting an RDD with toDS, I often find out that another conversion to a DataSet is needed, because something brought me to a DataFrame again.

Am I using the API wrongly? Should I stick with .toDF and only convert to a DataSet in the end of a chain of operations? Or is there a benefit to using toDS earlier?

Here is a small concrete example

spark
  .read
  .schema (...)
  .json (...)
  .rdd
  .zipWithUniqueId
  .map[(Integer,String,Double)] { case (row,id) => ... }
  .toDS // now with a Dataset API (should use toDF here?)
  .withColumnRenamed ("_1", "id" ) // now back to a DataFrame, not type safe :(
  .withColumnRenamed ("_2", "text")
  .withColumnRenamed ("_2", "overall")
  .as[ParsedReview] // back to a Dataset
like image 609
Andrzej Wąsowski Avatar asked Apr 12 '17 18:04

Andrzej Wąsowski


Video Answer


1 Answers

Michael Armburst nicely explained that shift to dataset and dataframe and the difference between the two. Basically in spark 2.x they converged dataset and dataframe API into one with slight difference. "DataFrame is just DataSet of generic row objects. When you don't know all the fields, DF is the answer". https://www.youtube.com/watch?v=1a4pgYzeFwE

like image 184
Sergio Alyoshkin Avatar answered Sep 20 '22 23:09

Sergio Alyoshkin