What is strongly-typed API and an untyped API with respect to Spark Datasets ?
How Datasets are similar/dissimilar to DataFrames?
A Dataset is a strongly typed collection of objects that can be transformed in parallel using functional or relational operations. A Dataset differs from an RDD in the following ways: - Internally, a Dataset is represented by a Catalyst logical plan and the data is stored in the encoded form.
Dataset is Spark SQL's strongly-typed structured query for working with semi- and structured data, i.e. records with a known schema, by means of encoders.
Because of that DataFrame is untyped and it is not type-safe. Datasets on the other hand check whether types conform to the specification at compile time. That's why Datasets are type safe.
The Dataframes provide API quickly to perform aggregation operations. The RDDs are slower than both the Dataframes and the Datasets to perform simple functions like data grouping. The Dataset is faster than the RDDs but is a bit slower than Dataframes. Hence, it performs aggregation faster than RDD and the Dataset.
Dataframe API's are untyped API's since the type will only be known during the runtime. Whereas dataset API's are typed API's for which the type will be known during the compile time.
df.select("device").where("signal > 10") // using untyped APIs
ds.filter(_.signal > 10).map(_.device) // using typed APIs
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With