To perform good performance with Spark. I'm a wondering if it is good to use sql queries via SQLContext
or if this is better to do queries via DataFrame functions like df.select()
.
Any idea? :)
Test results: RDD's outperformed DataFrames and SparkSQL for certain types of data processing. DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage.
SparkSQL is just that, SQL-based transformations, the DataFrame option is more method and calls, more familiar to the programmers. You can get the same result using both SparkSQL and DataFrame, in fact, most of the built-in Spark functionality mirrors each other between SparkSQL and DataFrame.
The high-level query language and additional type information makes Spark SQL more efficient. The Spark SQL uses of in-memory columnar storage. The in-memory columnar is a feature that allows storing the data in a columnar format, rather than row format.
Another important advantage of Spark SQL is that the loading and querying can be done for data from different sources. Hence, the data access is unified. It offers standard connectivity as Spark SQL can be connected through JDBC or ODBC. It can be used for faster processing of Hive tables.
There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.
Arguably DataFrame
queries are much easier to construct programmatically and provide a minimal type safety.
Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With HiveContext
, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).
Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style. In reality, there is a difference accordingly to the report by Hortonworks (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With