Which is efficient, Dataframe or RDD or hiveql?

CSV1

name,age,deparment_id

CSV2

department_id,deparment_name,location

I want to get a third CSV file with

name,age,deparment_name

I am loading both the CSV into dataframes. And then able to get the third dataframe using several methods join,select,filter,drop present in dataframe

I am also able to do the same using several RDD.map()

And I am also able to do the same using executing hiveql using HiveContext

I want to know which is the efficient way if my CSV files are huge and why?

414

asked Jul 16 '15 11:07

sag

1 Answers

This blog contains the benchmarks. Dataframes is much more efficient than RDD

https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Here is the snippet from blog

At a high level, there are two kinds of optimizations. First, Catalyst applies logical optimizations such as predicate pushdown. The optimizer can push filter predicates down into the data source, enabling the physical execution to skip irrelevant data. In the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding. In the case of relational databases, predicates are pushed down into the external databases to reduce the amount of data traffic. Second, Catalyst compiles operations into physical plans for execution and generates JVM bytecode for those plans that is often more optimized than hand-written code. For example, it can choose intelligently between broadcast joins and shuffle joins to reduce network traffic. It can also perform lower level optimizations such as eliminating expensive object allocations and reducing virtual function calls. As a result, we expect performance improvements for existing Spark programs when they migrate to DataFrames.

Here is the performance benchmark https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM.png

189

answered Oct 02 '22 18:10

sag

Related questions
                            
                                Get Last Monday in Spark
                            
                                Spark application kills executor
                            
                                How to restart Spark service in EMR after changing conf settings?
                            
                                Why accesing DataFrame from UDF results in NullPointerException?
                            
                                pyspark; check if an element is in collect_list [duplicate]
                            
                                Read ORC files directly from Spark shell
                            
                                How can I change SparkContext.sparkUser() setting (in pyspark)?
                            
                                Exiting Spark-shell from the scala script
                            
                                Spark java.lang.StackOverflowError
                            
                                Spark submit to yarn as a another user
                            
                                what is the most efficient way in pyspark to reduce a dataframe?
                            
                                How do we rank dataframe?
                            
                                Submitting spring boot application jar to spark-submit
                            
                                Pass system property to spark-submit and read file from classpath or custom path
                            
                                How to list files in S3 bucket using Spark Session?
                            
                                Spark: Sort records in groups?
                            
                                SPARK : failure: ``union'' expected but `(' found
                            
                                How to convert a JSON file to parquet using Apache Spark?
                            
                                Spark CrossValidatorModel access other models than the bestModel?
                            
                                Emit multiple pairs in map operation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Which is efficient, Dataframe or RDD or hiveql?

Tags:

apache-spark

apache-spark-sql

spark-dataframe

CSV1

CSV2

I want to get a third CSV file with

sag

People also ask

1 Answers

sag

Recent Activity

Donate For Us