during the learning of Spark 2 in Scala, I found that we can use two ways to query data in SparkSQL:
- spark.sql(SQL_STATEMENT) // variable "spark" is an instance of SparkSession
- DataSet/DataFrame.select/.where/.groupBy....
My question is what are the differences(functional, performance, etc.) bewtween the them? I tried to find the anwser on internet or their documentation, but failed, so I would like to listen to your opinions
I think both the query with SQL query and without SQL query are equivalent and equal. Both of same are in internals and use same engines inside. But I would prefer to user without SQL queries which are easier to write and provide some level of type safety.
among these
1. spark.sql(SQL_STATEMENT) // variable "spark" is a SparkSession
2. DataSet/DataFrame.select/.where/.groupBy....
I would choose number 2 for most of the case since it provides some lavel of typesafe
By using DataFrames which is a Java API one can debug the SQL statements by breaking them down into simple statements. This would help in better understanding.
The only thing that makes difference is what kind of underlying algorithm is used for grouping. HashAggregation vs SortAggregation HashAggregation would be more efficient than SortAggregation. SortAggregation - Will sort the rows and then gather together the matching rows. O(n*log n) HashAggregation creates a HashMap using key as grouping columns where as rest of the columns as values in a Map. Spark SQL uses HashAggregation where possible(If data for value is mutable). O(n)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With