Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark sql queries vs dataframe functions

To perform good performance with Spark. I'm a wondering if it is good to use sql queries via SQLContext or if this is better to do queries via DataFrame functions like df.select().

Any idea? :)

like image 657
Philippe Paulos Avatar asked Feb 05 '16 11:02

Philippe Paulos


People also ask

Is Spark SQL faster than Spark DataFrame?

Test results: RDD's outperformed DataFrames and SparkSQL for certain types of data processing. DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage.

What is difference between Spark SQL and DataFrame?

SparkSQL is just that, SQL-based transformations, the DataFrame option is more method and calls, more familiar to the programmers. You can get the same result using both SparkSQL and DataFrame, in fact, most of the built-in Spark functionality mirrors each other between SparkSQL and DataFrame.

Is Spark SQL more efficient?

The high-level query language and additional type information makes Spark SQL more efficient. The Spark SQL uses of in-memory columnar storage. The in-memory columnar is a feature that allows storing the data in a columnar format, rather than row format.

What is the advantage of Spark SQL?

Another important advantage of Spark SQL is that the loading and querying can be done for data from different sources. Hence, the data access is unified. It offers standard connectivity as Spark SQL can be connected through JDBC or ODBC. It can be used for faster processing of Hive tables.


2 Answers

There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.

  • Arguably DataFrame queries are much easier to construct programmatically and provide a minimal type safety.

  • Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With HiveContext, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).

like image 103
3 revs, 2 users 75% Avatar answered Oct 17 '22 09:10

3 revs, 2 users 75%


Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style. In reality, there is a difference accordingly to the report by Hortonworks (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.

like image 5
Danylo Zherebetskyy Avatar answered Oct 17 '22 10:10

Danylo Zherebetskyy