Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between DataSet API and DataFrame API [duplicate]

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark?

Can you convert one to the other?

like image 564
oikonomiyaki Avatar asked Jul 20 '15 02:07

oikonomiyaki


People also ask

What is the difference between Dataset and DataFrame?

DataFrames are a SparkSQL data abstraction and are similar to relational database tables or Python Pandas DataFrames. A Dataset is also a SparkSQL structure and represents an extension of the DataFrame API. The Dataset API combines the performance optimization of DataFrames and the convenience of RDDs.

Is Dataset faster than DataFrame?

3.14. RDD – RDD API is slower to perform simple grouping and aggregation operations. DataFrame – DataFrame API is very easy to use. It is faster for exploratory analysis, creating aggregated statistics on large data sets. DataSet – In Dataset it is faster to perform aggregation operation on plenty of data sets.

What is a DataFrame API?

​Using the Spark DataFrame API A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R or in the Python pandas library.

What is a Dataset API?

Dataset API is a set of operators with typed and untyped transformations, and actions to work with a structured query (as a Dataset) as a whole.


Video Answer


1 Answers

A DataFrame is defined well with a google search for "DataFrame definition":

A data frame is a table, or two-dimensional array-like structure, in which each column contains measurements on one variable, and each row contains one case.

So, a DataFrame has additional metadata due to its tabular format, which allows Spark to run certain optimizations on the finalized query.

An RDD, on the other hand, is merely a Resilient Distributed Dataset that is more of a blackbox of data that cannot be optimized as the operations that can be performed against it, are not as constrained.

However, you can go from a DataFrame to an RDD via its rdd method, and you can go from an RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method

In general it is recommended to use a DataFrame where possible due to the built in query optimization.

like image 180
Justin Pihony Avatar answered Oct 16 '22 10:10

Justin Pihony