Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which is efficient, Dataframe or RDD or hiveql?

I am newbie to Apache Spark.

My job is read two CSV files, select some specific columns from it, merge it, aggregate it and write the result into a single CSV file.

For example,

CSV1

name,age,deparment_id

CSV2

department_id,deparment_name,location

I want to get a third CSV file with

name,age,deparment_name

I am loading both the CSV into dataframes. And then able to get the third dataframe using several methods join,select,filter,drop present in dataframe

I am also able to do the same using several RDD.map()

And I am also able to do the same using executing hiveql using HiveContext

I want to know which is the efficient way if my CSV files are huge and why?

like image 414
sag Avatar asked Jul 16 '15 11:07

sag


People also ask

Which is better RDD or Dataframe?

RDD is slower than both Dataframes and Datasets to perform simple operations like grouping the data. It provides an easy API to perform aggregation operations. It performs aggregation faster than both RDDs and Datasets. Dataset is faster than RDDs but a bit slower than Dataframes.

Is Spark SQL faster than Dataframe?

Test results: RDD's outperformed DataFrames and SparkSQL for certain types of data processing. DataFrames and SparkSQL performed almost about the same, although with analysis involving aggregation and sorting SparkSQL had a slight advantage.

Why DataSet is faster than Dataframe?

DataSet gives the best performance than dataframe. DataSet provide Encoders and type-safe but dataframe still in usage is there any particular scenario only dataframe is used in that scenario or is there any function which is working on dataframe and not working in dataset.

What is the advantage of using Apache spark SQL over RDDs?

Apache Spark being an open-source framework for Bigdata has a various advantage over other big data solutions like Apache Spark is Dynamic in Nature, it supports in-memory Computation of RDDs. It provides a provision of reusability, Fault Tolerance, real-time stream processing and many more.


1 Answers

This blog contains the benchmarks. Dataframes is much more efficient than RDD

https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html

Here is the snippet from blog

At a high level, there are two kinds of optimizations. First, Catalyst applies logical optimizations such as predicate pushdown. The optimizer can push filter predicates down into the data source, enabling the physical execution to skip irrelevant data. In the case of Parquet files, entire blocks can be skipped and comparisons on strings can be turned into cheaper integer comparisons via dictionary encoding. In the case of relational databases, predicates are pushed down into the external databases to reduce the amount of data traffic. Second, Catalyst compiles operations into physical plans for execution and generates JVM bytecode for those plans that is often more optimized than hand-written code. For example, it can choose intelligently between broadcast joins and shuffle joins to reduce network traffic. It can also perform lower level optimizations such as eliminating expensive object allocations and reducing virtual function calls. As a result, we expect performance improvements for existing Spark programs when they migrate to DataFrames.

Here is the performance benchmark https://databricks.com/wp-content/uploads/2015/02/Screen-Shot-2015-02-16-at-9.46.39-AM.png

like image 189
sag Avatar answered Oct 02 '22 18:10

sag