What is meant by type safe in spark Dataset ?

Question

I am trying to understand the difference between Dataset and data frame and found the following helpful link , but i am not able to understand what is meant by type safe?

Difference between DataFrame (in Spark 2.0 i.e DataSet[Row] ) and RDD in Spark

Saman · Accepted Answer

RDDs and Datasets are type safe means that compiler know the Columns and it's data type of the Column whether it is Long, String, etc....

But, In Dataframe, every time when you call an action, collect() for instance,then it will return the result as an Array of Rows not as Long, String data type. In dataframe, Columns have their own type such as integer, String but they are not exposed to you. To you, its any type. To convert the Row of data into it's suitable type you have to use .asInstanceOf method.

eg: In Scala:

scala > :type df.collect()
Array[org.apache.spark.sql.Row]


df.collect().map{ row => 
    val str = row(0).asInstanceOf[String]
    val num = row(1).asInstanceOf[Long]
}

Neethu Lalitha · Answer

People who loves example, here it is:

create sample employee data

 case class Employ(name: String, age: Int, id: Int, department: String)

val empData = Seq(Employ("A", 24, 132, "HR"), Employ("B", 26, 131, "Engineering"), Employ("C", 25, 135, "Data Science"))

create an dataframe and dataset data

val empRDD = spark.sparkContext.makeRDD(empData)
val empDataFrame = empRDD.toDf()
val empDataset = empRDD.toDS()

Lets perform an operation :

Dataset

val empDatasetResult = empDataset.filter(employ => employ.age > 24)

Dataframe

    val empDatasetResult = empDataframe.filter(employ => employ.age > 24)

//thows error "value age is not a member of org.apache.spark.sql.Row object."

In the case of Dataframe when we perform lambda it returns a Row object and not an Integer object so you cant directly do employ.age > 24 , but you can do below:

val empDataFrameResult = empDataFrame.filter(employ => employ.getAs[Int]("age") > 24)

Why is the dataset so special then?

Less development labor Don't need to know the data type of data when performing an operation.

Who don't like boilerplate code? Let's create it using Datasets ..

Thanks to :https://blog.knoldus.com/spark-type-safety-in-dataset-vs-dataframe/

What is meant by type safe in spark Dataset ?

Tags:

apache-spark

apache-spark-sql

hari haran

2 Answers

Saman

Neethu Lalitha

Recent Activity

Donate For Us

What is meant by type safe in spark Dataset ?

Tags:

apache-spark

apache-spark-sql

hari haran

2 Answers

Saman

Neethu Lalitha

Related questions

Recent Activity

Donate For Us