Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Disadvantages of Spark Dataset over DataFrame

Tags:

apache-spark

I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations.

Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame.

Currently all our data engineering flows are using Spark (Scala)DataFrame. We would like to make use of Dataset, for all our new flows. So knowing all the limitations/disadvantages of Dataset would help us.

EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame, which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets

like image 330
Ranga Vure Avatar asked Mar 20 '19 18:03

Ranga Vure


People also ask

What is the difference between DataFrame and Dataset in Spark?

DataFrames allow the Spark to manage schema. DataSet – It also efficiently processes structured and unstructured data. It represents data in the form of JVM objects of row or a collection of row object. Which is represented in tabular forms through encoders.

Why is it beneficial to use DataFrames in Spark over RDDs?

It was introduced first in Spark version 1.3 to overcome the limitations of the Spark RDD. Spark Dataframes are the distributed collection of the data points, but here, the data is organized into the named columns. They allow developers to debug the code during the runtime which was not allowed with the RDDs.

Why Dataset is faster than DataFrame?

DataSet gives the best performance than dataframe. DataSet provide Encoders and type-safe but dataframe still in usage is there any particular scenario only dataframe is used in that scenario or is there any function which is working on dataframe and not working in dataset.


2 Answers

There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.

For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.

Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function (df.withColumn("rootX", sqrt("X"))) in Spark SQL but doing it in a lambda (ds.map(X => Math.sqrt(X))) would be less efficient since Spark can't optimize your lambda function as effectively.

There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.

In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.

like image 125
Matt Avatar answered Oct 12 '22 12:10

Matt


Limitations of Spark Datasets:

  1. Datasets used to be less performant (not sure if that's been fixed yet)
  2. You need to define a new case class whenever you change the Dataset schema, which is cumbersome
  3. Datasets don't offer as much type safety as you might expect. We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.
import java.sql.Date

case class Birth(hospitalName: String, birthDate: Date)

val birthsDS = Seq(
  Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+
like image 1
Powers Avatar answered Oct 12 '22 11:10

Powers