Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is meant by type safe in spark Dataset ?

I am trying to understand the difference between Dataset and data frame and found the following helpful link , but i am not able to understand what is meant by type safe?

Difference between DataFrame (in Spark 2.0 i.e DataSet[Row] ) and RDD in Spark

like image 326
hari haran Avatar asked Jan 29 '23 10:01

hari haran


2 Answers

RDDs and Datasets are type safe means that compiler know the Columns and it's data type of the Column whether it is Long, String, etc....

But, In Dataframe, every time when you call an action, collect() for instance,then it will return the result as an Array of Rows not as Long, String data type. In dataframe, Columns have their own type such as integer, String but they are not exposed to you. To you, its any type. To convert the Row of data into it's suitable type you have to use .asInstanceOf method.

eg: In Scala:

scala > :type df.collect()
Array[org.apache.spark.sql.Row]


df.collect().map{ row => 
    val str = row(0).asInstanceOf[String]
    val num = row(1).asInstanceOf[Long]
}                      
like image 95
Saman Avatar answered Mar 11 '23 14:03

Saman


People who loves example, here it is:

  1. create sample employee data
 case class Employ(name: String, age: Int, id: Int, department: String)

val empData = Seq(Employ("A", 24, 132, "HR"), Employ("B", 26, 131, "Engineering"), Employ("C", 25, 135, "Data Science"))
  1. create an dataframe and dataset data

    val empRDD = spark.sparkContext.makeRDD(empData)
    val empDataFrame = empRDD.toDf()
    val empDataset = empRDD.toDS()
    

Lets perform an operation :

Dataset

val empDatasetResult = empDataset.filter(employ => employ.age > 24)

Dataframe

    val empDatasetResult = empDataframe.filter(employ => employ.age > 24)

//thows error "value age is not a member of org.apache.spark.sql.Row object."

In the case of Dataframe when we perform lambda it returns a Row object and not an Integer object so you cant directly do employ.age > 24 , but you can do below:

val empDataFrameResult = empDataFrame.filter(employ => employ.getAs[Int]("age") > 24)

Why is the dataset so special then?

  • Less development labor Don't need to know the data type of data when performing an operation.

Who don't like boilerplate code? Let's create it using Datasets ..

Thanks to :https://blog.knoldus.com/spark-type-safety-in-dataset-vs-dataframe/

like image 40
Neethu Lalitha Avatar answered Mar 11 '23 14:03

Neethu Lalitha