Spark's toDS vs to DF

Question

I understand that one can convert an RDD to a Dataset using rdd.toDS. However there also exists rdd.toDF. Is there really any benefit of one over the other?

After playing with the Dataset API for a day, I find out that almost any operation takes me out to a DataFrame (for instance withColumn). After converting an RDD with toDS, I often find out that another conversion to a DataSet is needed, because something brought me to a DataFrame again.

Am I using the API wrongly? Should I stick with .toDF and only convert to a DataSet in the end of a chain of operations? Or is there a benefit to using toDS earlier?

Here is a small concrete example

spark
  .read
  .schema (...)
  .json (...)
  .rdd
  .zipWithUniqueId
  .map[(Integer,String,Double)] { case (row,id) => ... }
  .toDS // now with a Dataset API (should use toDF here?)
  .withColumnRenamed ("_1", "id" ) // now back to a DataFrame, not type safe :(
  .withColumnRenamed ("_2", "text")
  .withColumnRenamed ("_2", "overall")
  .as[ParsedReview] // back to a Dataset

Sergio Alyoshkin · Accepted Answer

Michael Armburst nicely explained that shift to dataset and dataframe and the difference between the two. Basically in spark 2.x they converged dataset and dataframe API into one with slight difference. "DataFrame is just DataSet of generic row objects. When you don't know all the fields, DF is the answer".

Spark's toDS vs to DF

Tags:

scala

apache-spark

Andrzej Wąsowski

Video Answer

1 Answers

Sergio Alyoshkin

Recent Activity

Donate For Us

Spark's toDS vs to DF

Tags:

scala

apache-spark

Andrzej Wąsowski

Video Answer

1 Answers

Sergio Alyoshkin

Related questions

Recent Activity

Donate For Us