Spark SQL DataFrame/Dataset execution engine has several extremely efficient time & space optimizations (e.g. InternalRow & expression codeGen). According to many documentations, it seems to be a better option than RDD for most distributed algorithms. However, I did some sourcecode research and am still not convinced. I have no doubt that InternalRow is much more compact and can save large amount of memory. But execution of algorithms may not be any faster saving predefined expressions. Namely, it is indicated in sourcecode of <code>org.apache.spark.sql.catalyst.expressions.ScalaUDF</code>, that every user defined function does 3 things: <ol> <li>convert catalyst type (used in InternalRow) to scala type (used in GenericRow).</li> <li>apply the function</li> <li>convert the result back from scala type to catalyst type</li> </ol> Apparently this is even slower than just applying the function directly on RDD without any conversion. Can anyone confirm or deny my speculation by some real-case profiling and code analysis? Thank you so much for any suggestion or insight.

From this Databricks' blog article A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets <blockquote> When to use RDDs? Consider these scenarios or common use cases for using RDDs when: <ul> <li>you want low-level transformation and actions and control on your dataset; </li> <li>your data is unstructured, such as media streams or streams of text;</li> <li>you want to manipulate your data with functional programming constructs than domain specific expressions;</li> <li>you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column;</li> <li>and you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.</li> </ul> </blockquote> In High Performance Spark's Chapter 3. DataFrames, Datasets, and Spark SQL, you can see some performance you can get with the Dataframe/Dataset API compared to RDD <img src="https://i.stack.imgur.com/qS7hh.png" alt="enter image description here"> And in the Databricks' article mentioned you can also find that Dataframe optimizes space usage compared to RDD <img src="https://i.stack.imgur.com/kUu4R.png" alt="enter image description here">

When to use Spark DataFrame/Dataset API and when to use plain RDD?

Tags:

apache-spark

apache-spark-sql

apache-spark-dataset

spark-dataframe

Spark SQL DataFrame/Dataset execution engine has several extremely efficient time & space optimizations (e.g. InternalRow & expression codeGen). According to many documentations, it seems to be a better option than RDD for most distributed algorithms.

However, I did some sourcecode research and am still not convinced. I have no doubt that InternalRow is much more compact and can save large amount of memory. But execution of algorithms may not be any faster saving predefined expressions. Namely, it is indicated in sourcecode of org.apache.spark.sql.catalyst.expressions.ScalaUDF, that every user defined function does 3 things:

convert catalyst type (used in InternalRow) to scala type (used in GenericRow).
apply the function
convert the result back from scala type to catalyst type

Apparently this is even slower than just applying the function directly on RDD without any conversion. Can anyone confirm or deny my speculation by some real-case profiling and code analysis?

Thank you so much for any suggestion or insight.

858

asked May 30 '16 20:05

tribbloid

1 Answers

From this Databricks' blog article A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets

When to use RDDs?

Consider these scenarios or common use cases for using RDDs when:

you want low-level transformation and actions and control on your dataset;

your data is unstructured, such as media streams or streams of text;

you want to manipulate your data with functional programming constructs than domain specific expressions;

you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column;

and you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

In High Performance Spark's Chapter 3. DataFrames, Datasets, and Spark SQL, you can see some performance you can get with the Dataframe/Dataset API compared to RDD

enter image description here

And in the Databricks' article mentioned you can also find that Dataframe optimizes space usage compared to RDD

enter image description here

answered Sep 17 '22 08:09

Paul Leclercq

Related questions
                            
                                How to do time-series simple forecast?
                            
                                How do I process a graph that is constantly updating, with low latency?
                            
                                Is it necessary to submit spark application jar?
                            
                                Elaboration on why shuffle write data is way more then input data in apache spark
                            
                                How to clean up other resources when spark gets stopped
                            
                                Amazon EMR - how to set a timeout for a step
                            
                                Does Spark allow to use Amazon Assumed Role and STS temporary credentials for DynamoDB?
                            
                                Pyspark read csv with schema, header check, and store corrupt records
                            
                                How to avoid one Spark Streaming window blocking another window with both running some native Python code
                            
                                Prevent more IO with multiple pipelines on the same RDD
                            
                                PCA in Spark MLlib and Spark ML
                            
                                How to get accuracy precision, recall and ROC from cross validation in Spark ml lib?
                            
                                How to clean spark history event log with out stopping spark streaming
                            
                                Performance decrease for huge amount of columns. Pyspark
                            
                                Disable spark catalyst optimizer
                            
                                Spark out of memory
                            
                                Does Spark optimize chained transformations?
                            
                                Multiple resolvers having different access mechanism configured with same name 'sbt-plugin-releases'
                            
                                Scalatest Maven Plugin "no tests were executed"
                            
                                "spark.memory.fraction" seems to have no effect

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With