To perform good performance with Spark. I'm a wondering if it is good to use sql queries via <code>SQLContext</code> or if this is better to do queries via DataFrame functions like <code>df.select()</code>. Any idea? :)

There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences. <ul> <li>Arguably <code>DataFrame</code> queries are much easier to construct programmatically and provide a minimal type safety. </li> <li>Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With <code>HiveContext</code>, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).</li> </ul>

Spark sql queries vs dataframe functions

2 Answers

There is no performance difference whatsoever. Both methods use exactly the same execution engine and internal data structures. At the end of the day, all boils down to personal preferences.

Arguably DataFrame queries are much easier to construct programmatically and provide a minimal type safety.
Plain SQL queries can be significantly more concise and easier to understand. They are also portable and can be used without any modifications with every supported language. With HiveContext, these can also be used to expose some functionalities which can be inaccessible in other ways (for example UDF without Spark wrappers).

103

answered Oct 17 '22 09:10

3 revs, 2 users 75%

Ideally, the Spark's catalyzer should optimize both calls to the same execution plan and the performance should be the same. How to call is just a matter of your style. In reality, there is a difference accordingly to the report by Hortonworks (https://community.hortonworks.com/articles/42027/rdd-vs-dataframe-vs-sparksql.html ), where SQL outperforms Dataframes for a case when you need GROUPed records with their total COUNTS that are SORT DESCENDING by record name.

answered Oct 17 '22 10:10

Danylo Zherebetskyy

Related questions
                            
                                Android Tab Layout tabs with round corners
                            
                                Can't run the server on Django (connection refused)
                            
                                boto3: Spot Instance Creation
                            
                                Avoiding and renaming .x and .y columns when merging or joining in r
                            
                                Proper way to dispose a one-off observable in RxSwift
                            
                                iTunes Connect - Is your app designed to use cryptography?
                            
                                Prevent Vue.js to display brackets on slow clients [duplicate]
                            
                                Angular 2 how to send events from grandchild to parent component?
                            
                                How to package an Electron app into a single executable?
                            
                                Spark: long delay between jobs
                            
                                Partial Methods in C# Explanation
                            
                                What is the difference between std::tie and std::make_tuple with std::ref arguments?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark sql queries vs dataframe functions

Tags:

performance

sql

dataframe

apache-spark

apache-spark-sql

Philippe Paulos

People also ask

2 Answers

3 revs, 2 users 75%

Danylo Zherebetskyy

Recent Activity

Donate For Us