I know the advantages of <code>Dataset</code> (type safety etc), but i can't find any documentation related Spark Datasets Limitations. Are there any specific scenarios where Spark <code>Dataset</code> is not recommended and better to use <code>DataFrame</code>. Currently all our data engineering flows are using Spark (Scala)<code>DataFrame</code>. We would like to make use of <code>Dataset</code>, for all our new flows. So knowing all the limitations/disadvantages of <code>Dataset</code> would help us. EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame, which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets

There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset. For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access. Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function (<code>df.withColumn("rootX", sqrt("X"))</code>) in Spark SQL but doing it in a lambda (<code>ds.map(X => Math.sqrt(X))</code>) would be less efficient since Spark can't optimize your lambda function as effectively. There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset. In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.

Limitations of Spark Datasets: <ol> <li>Datasets used to be less performant (not sure if that's been fixed yet)</li> <li>You need to define a new case class whenever you change the Dataset schema, which is cumbersome</li> <li>Datasets don't offer as much type safety as you might expect. We can pass the <code>reverse</code> function a date object and it'll return a garbage response rather than erroring out.</li> </ol> <pre class="prettyprint lang-scala prettyprint-override"><code>import java.sql.Date case class Birth(hospitalName: String, birthDate: Date) val birthsDS = Seq( Birth("westchester", Date.valueOf("2014-01-15")) ).toDS() birthsDS.withColumn("meaningless", reverse($"birthDate")).show() </code></pre> <pre class="prettyprint"><code>+------------+----------+-----------+ |hospitalName| birthDate|meaningless| +------------+----------+-----------+ | westchester|2014-01-15| 51-10-4102| +------------+----------+-----------+ </code></pre>

Disadvantages of Spark Dataset over DataFrame

Tags:

apache-spark

I know the advantages of Dataset (type safety etc), but i can't find any documentation related Spark Datasets Limitations.

Are there any specific scenarios where Spark Dataset is not recommended and better to use DataFrame.

Currently all our data engineering flows are using Spark (Scala)DataFrame. We would like to make use of Dataset, for all our new flows. So knowing all the limitations/disadvantages of Dataset would help us.

EDIT: This is not similar to Spark 2.0 Dataset vs DataFrame, which explains some operations on Dataframe/Dataset. or other questions, which most of them explains the differences between rdd, dataframe and dataset and how they evolved. This is targeted to know, when NOT to use Datasets

330

asked Mar 20 '19 18:03

Ranga Vure

2 Answers

There are a few scenarios where I find that a Dataframe (or Dataset[Row]) is more useful than a typed dataset.

For example, when I'm consuming data without a fixed schema, like JSON files containing records of different types with different fields. Using a Dataframe I can easily "select" out the fields I need without needing to know the whole schema, or even use a runtime configuration to specify the fields I'll access.

Another consideration is that Spark can better optimize the built-in Spark SQL operations and aggregations than UDAFs and custom lambdas. So if you want to get the square root of a value in a column, that's a built-in function (df.withColumn("rootX", sqrt("X"))) in Spark SQL but doing it in a lambda (ds.map(X => Math.sqrt(X))) would be less efficient since Spark can't optimize your lambda function as effectively.

There are also many untyped Dataframe functions (like statistical functions) that are implemented for Dataframes but not typed Datasets, and you'll often find that even if you start out with a Dataset, by the time you've finished your aggregations you're left with a Dataframe because the functions work by creating new columns, modifying the schema of your dataset.

In general I don't think you should migrate from working Dataframe code to typed Datasets unless you have a good reason to. Many of the Dataset features are still flagged as "experimental" as of Spark 2.4.0, and as mentioned above not all Dataframe features have Dataset equivalents.

125

answered Oct 12 '22 12:10

Matt

Limitations of Spark Datasets:

Datasets used to be less performant (not sure if that's been fixed yet)
You need to define a new case class whenever you change the Dataset schema, which is cumbersome
Datasets don't offer as much type safety as you might expect. We can pass the reverse function a date object and it'll return a garbage response rather than erroring out.

import java.sql.Date

case class Birth(hospitalName: String, birthDate: Date)

val birthsDS = Seq(
  Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()

+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+

answered Oct 12 '22 11:10

Powers

Related questions
                            
                                Spark ML indexer cannot resolve DataFrame column name with dots?
                            
                                Application attempt appattempt_*** doesn't exist in ApplicationMasterService cache
                            
                                How to speed up Spark SQL unit tests?
                            
                                Why is Spark performing worse when using Kryo serialization?
                            
                                Spark 1.6: java.lang.IllegalArgumentException: spark.sql.execution.id is already set
                            
                                Comparison between fasttext and LDA
                            
                                How do you create merge_asof functionality in PySpark?
                            
                                Spark - java IOException :Failed to create local dir in /tmp/blockmgr*
                            
                                pyspark using one task for mapPartitions when converting rdd to dataframe
                            
                                Spark is only using one worker machine when more are available
                            
                                If I cache a Spark Dataframe and then overwrite the reference, will the original data frame still be cached?
                            
                                Output from Dataproc Spark job in Google Cloud Logging
                            
                                Read and write empty string "" vs NULL in Spark 2.0.1
                            
                                Apache Spark - Dealing with Sliding Windows on Temporal RDDs
                            
                                Caching intermediate results in Spark ML pipeline
                            
                                What is the correct way to start/stop spark streaming jobs in yarn?
                            
                                Spark Java Error: Size exceeds Integer.MAX_VALUE
                            
                                Dealing with a large gzipped file in Spark
                            
                                Extract document-topic matrix from Pyspark LDA Model
                            
                                local class incompatible Exception: when running spark standalone from IDE

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With