Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Writing SQL vs using Dataframe APIs in Spark SQL

I am a newbie in Spark SQL world. I am currently migrating my application's Ingestion code which includes ingesting data in stage,Raw and Application layer in HDFS and doing CDC(change data capture), this is currently written in Hive queries and is executed via Oozie. This needs to migrate into a Spark application(current version 1.6). The other section of code will migrate later on.

In spark-SQL, I can create dataframes directly from tables in Hive and simply execute queries as it is (like sqlContext.sql("my hive hql") ). The other way would be to use dataframe APIs and rewrite the hql in that way.

What is the difference in these two approaches?

Is there any performance gain with using Dataframe APIs?

Some people suggested, there is an extra layer of SQL that spark core engine has to go through when using "SQL" queries directly which may impact performance to some extent but I didn't find any material substantiating that statement. I know the code would be much more compact with Datafrmae APIs but when I have my hql queries all handy would it really worth to write complete code into Dataframe API?

Thank You.

like image 596
PPPP Avatar asked Aug 01 '17 06:08

PPPP


People also ask

Which is better Spark SQL or DataFrame?

Test results: RDD's outperformed DataFrames and SparkSQL for certain types of data processing. DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage.

Is Spark SQL slower than DataFrame?

There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures.

What is a major benefit of using the DataFrame API of Apache spark compared to its RDD API?

RDD – RDD API is slower to perform simple grouping and aggregation operations. DataFrame – DataFrame API is very easy to use. It is faster for exploratory analysis, creating aggregated statistics on large data sets.

Does Spark SQL support Dataset API?

Integration With SparkSpark SQL allows us to query structured data inside Spark programs, using SQL or a DataFrame API which can be used in Java, Scala, Python and R.


4 Answers

Question : What is the difference in these two approaches? Is there any performance gain with using Dataframe APIs?


Answer :

There is comparative study done by horton works. source...

Gist is based on situation/scenario each one is right. there is no hard and fast rule to decide this. pls go through below..

RDDs, DataFrames, and SparkSQL (infact 3 approaches not just 2):

At its core, Spark operates on the concept of Resilient Distributed Datasets, or RDD’s:

  • Resilient - if data in memory is lost, it can be recreated
  • Distributed - immutable distributed collection of objects in memory partitioned across many data nodes in a cluster
  • Dataset - initial data can from from files, be created programmatically, from data in memory, or from another RDD

DataFrames API is a data abstraction framework that organizes your data into named columns:

  • Create a schema for the data
  • Conceptually equivalent to a table in a relational database
  • Can be constructed from many sources including structured data files, tables in Hive, external databases, or existing RDDs
  • Provides a relational view of the data for easy SQL like data manipulations and aggregations
  • Under the hood, it is an RDD of Row’s

SparkSQL is a Spark module for structured data processing. You can interact with SparkSQL through:

  • SQL
  • DataFrames API
  • Datasets API

Test results:

  • RDD’s outperformed DataFrames and SparkSQL for certain types of data processing
  • DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage

  • Syntactically speaking, DataFrames and SparkSQL are much more intuitive than using RDD’s

  • Took the best out of 3 for each test

  • Times were consistent and not much variation between tests

  • Jobs were run individually with no other jobs running

Random lookup against 1 order ID from 9 Million unique order ID's GROUP all the different products with their total COUNTS and SORT DESCENDING by product name

enter image description here

like image 78
Ram Ghadiyaram Avatar answered Oct 18 '22 20:10

Ram Ghadiyaram


In your Spark SQL string queries, you won't know a syntax error until runtime (which could be costly), whereas in DataFrames syntax errors can be caught at compile time.

like image 22
Arun Sharma Avatar answered Oct 18 '22 21:10

Arun Sharma


Couple more additions. Dataframe uses tungsten memory representation , catalyst optimizer used by sql as well as dataframe. With Dataset API, you have more control on the actual execution plan than with SparkSQL

like image 2
Blue Clouds Avatar answered Oct 18 '22 22:10

Blue Clouds


If query is lengthy, then efficient writing & running query, shall not be possible. On the other hand, DataFrame, along with Column API helps developer to write compact code, which is ideal for ETL applications.

Also, all operations (e.g. greater than, less than, select, where etc.).... ran using "DataFrame" builds an "Abstract Syntax Tree(AST)", which is then passed to "Catalyst" for further optimizations. (Source: Spark SQL Whitepaper, Section#3.3)

like image 1
G.S.Tomar Avatar answered Oct 18 '22 21:10

G.S.Tomar