Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark Datasets - strong typing

What is strongly-typed API and an untyped API with respect to Spark Datasets ?

How Datasets are similar/dissimilar to DataFrames?

like image 946
Arvind Kumar Avatar asked Nov 09 '16 12:11

Arvind Kumar


People also ask

What is strongly typed Dataset in Spark?

A Dataset is a strongly typed collection of objects that can be transformed in parallel using functional or relational operations. A Dataset differs from an RDD in the following ways: - Internally, a Dataset is represented by a Catalyst logical plan and the data is stored in the encoded form.

Is Spark strongly typed?

Dataset is Spark SQL's strongly-typed structured query for working with semi- and structured data, i.e. records with a known schema, by means of encoders.

Why Dataset is Typesafe?

Because of that DataFrame is untyped and it is not type-safe. Datasets on the other hand check whether types conform to the specification at compile time. That's why Datasets are type safe.

Why Dataset is faster than RDD?

The Dataframes provide API quickly to perform aggregation operations. The RDDs are slower than both the Dataframes and the Datasets to perform simple functions like data grouping. The Dataset is faster than the RDDs but is a bit slower than Dataframes. Hence, it performs aggregation faster than RDD and the Dataset.


1 Answers

Dataframe API's are untyped API's since the type will only be known during the runtime. Whereas dataset API's are typed API's for which the type will be known during the compile time.

df.select("device").where("signal > 10")      // using untyped APIs   
ds.filter(_.signal > 10).map(_.device)         // using typed APIs
like image 110
Vignesh I Avatar answered Nov 10 '22 15:11

Vignesh I