Differences between query with SQL and without SQL in SparkSQL

Question

during the learning of Spark 2 in Scala, I found that we can use two ways to query data in SparkSQL:

spark.sql(SQL_STATEMENT) // variable "spark" is an instance of SparkSession

DataSet/DataFrame.select/.where/.groupBy....

My question is what are the differences(functional, performance, etc.) bewtween the them? I tried to find the anwser on internet or their documentation, but failed, so I would like to listen to your opinions

koiralo · Accepted Answer

I think both the query with SQL query and without SQL query are equivalent and equal. Both of same are in internals and use same engines inside. But I would prefer to user without SQL queries which are easier to write and provide some level of type safety.

among these

  1.  spark.sql(SQL_STATEMENT) // variable "spark" is a SparkSession
  2.  DataSet/DataFrame.select/.where/.groupBy....

I would choose number 2 for most of the case since it provides some lavel of typesafe

Sourab · Answer

By using DataFrames which is a Java API one can debug the SQL statements by breaking them down into simple statements. This would help in better understanding.

The only thing that makes difference is what kind of underlying algorithm is used for grouping. HashAggregation vs SortAggregation HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n) HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map. Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)

Differences between query with SQL and without SQL in SparkSQL

Tags:

scala

apache-spark

apache-spark-sql

llyjy21

2 Answers

koiralo

Sourab

Recent Activity

Donate For Us

Differences between query with SQL and without SQL in SparkSQL

Tags:

scala

apache-spark

apache-spark-sql

llyjy21

2 Answers

koiralo

Sourab

Related questions

Recent Activity

Donate For Us