Writing SQL vs using Dataframe APIs in Spark SQL

Tags:

I am a newbie in Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage,Raw and Application layer in HDFS and doing CDC(change data capture), this is currently written in Hive queries and is executed via Oozie. This needs to migrate into a Spark application(current version 1.6). The other section of code will migrate later on.

In spark-SQL, I can create dataframes directly from tables in Hive and simply execute queries as it is (like sqlContext.sql("my hive hql") ). The other way would be to use dataframe APIs and rewrite the hql in that way.

What is the difference in these two approaches?

Is there any performance gain with using Dataframe APIs?

Some people suggested, there is an extra layer of SQL that spark core engine has to go through when using "SQL" queries directly which may impact performance to some extent but I didn't find any material substantiating that statement. I know the code would be much more compact with Datafrmae APIs but when I have my hql queries all handy would it really worth to write complete code into Dataframe API?

Thank You.

596

asked Aug 01 '17 06:08

PPPP

4 Answers

Question : What is the difference in these two approaches? Is there any performance gain with using Dataframe APIs?

Answer :

There is comparative study done by horton works. source...

Gist is based on situation/scenario each one is right. there is no hard and fast rule to decide this. pls go through below..

RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):

At its core, Spark operates on the concept of Resilient Distributed Datasets, or RDD’s:

Resilient - if data in memory is lost, it can be recreated
Distributed - immutable distributed collection of objects in memory partitioned across many data nodes in a cluster
Dataset - initial data can from from files, be created programmatically, from data in memory, or from another RDD

DataFrames API is a data abstraction framework that organizes your data into named columns:

Create a schema for the data
Conceptually equivalent to a table in a relational database
Can be constructed from many sources including structured data files, tables in Hive, external databases, or existing RDDs
Provides a relational view of the data for easy SQL like data manipulations and aggregations
Under the hood, it is an RDD of Row’s

SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL through:

SQL
DataFrames API
Datasets API

Test results:

RDD’s outperformed DataFrames and SparkSQL for certain types of data processing
DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage
Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD’s
Took the best out of 3 for each test
Times were consistent and not much variation between tests
Jobs were run individually with no other jobs running

Random lookup against 1 order ID from 9 Million unique order ID's GROUP all the different products with their total COUNTS and SORT DESCENDING by product name

enter image description here

answered Oct 18 '22 20:10

answered Oct 18 '22 21:10

G.S.Tomar

Related questions
                            
                                Easiest way to install Python dependencies on Spark executor nodes?
                            
                                Determining optimal number of Spark partitions based on workers, cores and DataFrame size
                            
                                Spark Unable to load native-hadoop library for your platform
                            
                                How to partition and write DataFrame in Spark without deleting partitions with no new data?
                            
                                What is spark.driver.maxResultSize?
                            
                                Spark RDD - Mapping with extra arguments
                            
                                How do I install pyspark for use in standalone scripts?
                            
                                Spark Scala list folders in directory
                            
                                Multiple Aggregate operations on the same column of a spark dataframe
                            
                                DataFrame-ified zipWithIndex
                            
                                multiple conditions for filter in spark data frames
                            
                                Filter Spark DataFrame based on another DataFrame that specifies denylist criteria
                            
                                Transpose column to row with Spark
                            
                                How to write spark streaming DF to Kafka topic
                            
                                How to add third-party Java JAR files for use in PySpark
                            
                                How to integrate Apache Spark with MySQL for reading database tables as a spark dataframe? [closed]
                            
                                Filtering a pyspark dataframe using isin by exclusion [duplicate]
                            
                                Spark - How to write a single csv file WITHOUT folder?
                            
                                Mind blown: RDD.zip() method
                            
                                Spark add new column to dataframe with value from previous row

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Writing SQL vs using Dataframe APIs in Spark SQL

Tags:

apache-spark

apache-spark-sql

hive

hdfs

PPPP

People also ask

4 Answers

RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):

Test results:

Ram Ghadiyaram

Arun Sharma

Blue Clouds

G.S.Tomar

Recent Activity

Donate For Us