starting out with spark 2.0.1 I got some questions. I read a lot of documentation but so far could not find sufficient answers: <ul> <li>What is the difference between <ul> <li><code>df.select("foo")</code></li> <li><code>df.select($"foo")</code></li> </ul> </li> <li>do I understand correctly that <ul> <li> <code>myDataSet.map(foo.someVal)</code> is typesafe and will not convert into <code>RDD</code> but stay in DataSet representation / no additional overhead (performance wise for 2.0.0) </li> </ul> </li> <li>all the other commands e.g. select, .. are just syntactic sugar. They are not typesafe and a map could be used instead. How could I <code>df.select("foo")</code> type-safe without a map statement? <ul> <li>why should I use a UDF / UADF instead of a map (assuming map stays in the dataset representation)?</li> </ul> </li> </ul>

<ol> <li>Difference between <code>df.select("foo")</code> and <code>df.select($"foo")</code> is signature. The former one takes at least one <code>String</code>, the later one zero or more <code>Columns</code>. There is no practical difference beyond that.</li> <li> <code>myDataSet.map(foo.someVal)</code> type checks, but as any <code>Dataset</code> operation uses <code>RDD</code> of objects, and compared to <code>DataFrame</code> operations, there is a significant overhead. Let's take a look at a simple example: <pre class="prettyprint lang-scala prettyprint-override"><code>case class FooBar(foo: Int, bar: String) val ds = Seq(FooBar(1, "x")).toDS ds.map(_.foo).explain </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>== Physical Plan == *SerializeFromObject [input[0, int, true] AS value#123] +- *MapElements <function1>, obj#122: int +- *DeserializeToObject newInstance(class $line67.$read$$iw$$iw$FooBar), obj#121: $line67.$read$$iw$$iw$FooBar +- LocalTableScan [foo#117, bar#118] </code></pre> As you can see this execution plan requires access to all fields and has to <code>DeserializeToObject</code>. </li> <li> No. In general other methods are not syntactic sugar and generate a significantly different execution plan. For example: <pre class="prettyprint lang-scala prettyprint-override"><code>ds.select($"foo").explain </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>== Physical Plan == LocalTableScan [foo#117] </code></pre> Compared to the plan shown before it can access column directly. It is not so much a limitation of the API but a result of a difference in the operational semantics. </li> <li> <blockquote> How could I df.select("foo") type-safe without a map statement? </blockquote> There is no such option. While typed columns allow you to transform statically <code>Dataset</code> into another statically typed <code>Dataset</code>: <pre class="prettyprint"><code>ds.select($"bar".as[Int]) </code></pre> there are not type safe. There some other attempts to include type safe optimized operations, like typed aggregations, but this experimental API. </li> <li> <blockquote> why should I use a UDF / UADF instead of a map </blockquote> It is completely up to you. Each distributed data structure in Spark provides its own advantages and disadvantages (see for example Spark UDAF with ArrayType as bufferSchema performance issues). </li> </ol> Personally, I find statically typed <code>Dataset</code> to be the least useful: <ul> <li>Don't provide the same range of optimizations as <code>Dataset[Row]</code> (although they share storage format and some execution plan optimizations it doesn't fully benefit from code generation or off-heap storage) nor access to all the analytical capabilities of the <code>DataFrame</code>.</li> <li> Typed transformations are black boxes, and effectively create analysis barrier for the optimizer. For example selections (filters) cannot be be pushed over typed transformation: <pre class="prettyprint"><code>ds.groupBy("foo").agg(sum($"bar") as "bar").as[FooBar].filter(x => true).where($"foo" === 1).explain </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>== Physical Plan == *Filter (foo#133 = 1) +- *Filter <function1>.apply +- *HashAggregate(keys=[foo#133], functions=[sum(cast(bar#134 as double))]) +- Exchange hashpartitioning(foo#133, 200) +- *HashAggregate(keys=[foo#133], functions=[partial_sum(cast(bar#134 as double))]) +- LocalTableScan [foo#133, bar#134] </code></pre> Compared to: <pre class="prettyprint"><code>ds.groupBy("foo").agg(sum($"bar") as "bar").as[FooBar].where($"foo" === 1).explain </code></pre> <pre class="prettyprint lang-none prettyprint-override"><code>== Physical Plan == *HashAggregate(keys=[foo#133], functions=[sum(cast(bar#134 as double))]) +- Exchange hashpartitioning(foo#133, 200) +- *HashAggregate(keys=[foo#133], functions=[partial_sum(cast(bar#134 as double))]) +- *Filter (foo#133 = 1) +- LocalTableScan [foo#133, bar#134] </code></pre> This impacts features like predicate pushdown or projection pushdown. </li> <li>There are not as flexible as <code>RDDs</code> with only a small subset of types supported natively.</li> <li>"Type safety" with <code>Encoders</code> is disputable when <code>Dataset</code> is converted using <code>as</code> method. Because data shape is not encoded using a signature, a compiler can only verify the existence of an <code>Encoder</code>.</li> </ul> Related questions: <ul> <li>Perform a typed join in Scala with Spark Datasets</li> <li>Spark 2.0 DataSets groupByKey and divide operation and type safety</li> </ul>

Spark 2.0 Dataset vs DataFrame

1 Answers

Difference between df.select("foo") and df.select($"foo") is signature. The former one takes at least one String, the later one zero or more Columns. There is no practical difference beyond that.

myDataSet.map(foo.someVal) type checks, but as any Dataset operation uses RDD of objects, and compared to DataFrame operations, there is a significant overhead. Let's take a look at a simple example:

case class FooBar(foo: Int, bar: String) val ds = Seq(FooBar(1, "x")).toDS ds.map(_.foo).explain

== Physical Plan == *SerializeFromObject [input[0, int, true] AS value#123] +- *MapElements <function1>, obj#122: int    +- *DeserializeToObject newInstance(class $line67.$read$$iw$$iw$FooBar), obj#121: $line67.$read$$iw$$iw$FooBar       +- LocalTableScan [foo#117, bar#118]

As you can see this execution plan requires access to all fields and has to DeserializeToObject.

No. In general other methods are not syntactic sugar and generate a significantly different execution plan. For example:
```
ds.select($"foo").explain 
```
```
== Physical Plan == LocalTableScan [foo#117] 
```
Compared to the plan shown before it can access column directly. It is not so much a limitation of the API but a result of a difference in the operational semantics.
How could I df.select("foo") type-safe without a map statement?

There is no such option. While typed columns allow you to transform statically Dataset into another statically typed Dataset:
```
ds.select($"bar".as[Int]) 
```
there are not type safe. There some other attempts to include type safe optimized operations, like typed aggregations, but this experimental API.
why should I use a UDF / UADF instead of a map

It is completely up to you. Each distributed data structure in Spark provides its own advantages and disadvantages (see for example Spark UDAF with ArrayType as bufferSchema performance issues).

Personally, I find statically typed Dataset to be the least useful:

Don't provide the same range of optimizations as Dataset[Row] (although they share storage format and some execution plan optimizations it doesn't fully benefit from code generation or off-heap storage) nor access to all the analytical capabilities of the DataFrame.

Typed transformations are black boxes, and effectively create analysis barrier for the optimizer. For example selections (filters) cannot be be pushed over typed transformation:

ds.groupBy("foo").agg(sum($"bar") as "bar").as[FooBar].filter(x => true).where($"foo" === 1).explain

== Physical Plan == *Filter (foo#133 = 1) +- *Filter <function1>.apply    +- *HashAggregate(keys=[foo#133], functions=[sum(cast(bar#134 as double))])       +- Exchange hashpartitioning(foo#133, 200)          +- *HashAggregate(keys=[foo#133], functions=[partial_sum(cast(bar#134 as double))])             +- LocalTableScan [foo#133, bar#134]

Compared to:

ds.groupBy("foo").agg(sum($"bar") as "bar").as[FooBar].where($"foo" === 1).explain

== Physical Plan == *HashAggregate(keys=[foo#133], functions=[sum(cast(bar#134 as double))]) +- Exchange hashpartitioning(foo#133, 200)    +- *HashAggregate(keys=[foo#133], functions=[partial_sum(cast(bar#134 as double))])       +- *Filter (foo#133 = 1)          +- LocalTableScan [foo#133, bar#134]

This impacts features like predicate pushdown or projection pushdown.

There are not as flexible as RDDs with only a small subset of types supported natively.
"Type safety" with Encoders is disputable when Dataset is converted using as method. Because data shape is not encoded using a signature, a compiler can only verify the existence of an Encoder.

zero323

Related questions
                            
                                no configuration setting found for key akka
                            
                                Using the "should NOT produce [exception]" syntax in ScalaTest
                            
                                How to uncache RDD?
                            
                                Purpose of "return" statement in Scala?
                            
                                Multiple scala libraies causing error in intellij?
                            
                                Understand how to use apply and unapply
                            
                                Check if a string is blank or doesn't exist in Scala
                            
                                Is there a simple way to specify a global dependency exclude in SBT
                            
                                Scala: Using HashMap with a default value
                            
                                Delete directory recursively in Scala
                            
                                What's so great about Scala? [closed]
                            
                                Why Scala's-SBT is too slow
                            
                                Setting to get SBT to fail fast (stop) on an error
                            
                                Log implicits only for "diverging implicit expansion"s
                            
                                Scala, Android and Eclipse
                            
                                Why implicitConversions is required for implicit defs but not classes?
                            
                                Scala return statements in anonymous functions
                            
                                Scala streaming library differences (Reactive Streams/Iteratee/RxScala/Scalaz...)
                            
                                What is the standard idiom for implementing equals and hashCode in Scala?
                            
                                Scala and interfaces

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark 2.0 Dataset vs DataFrame

Tags:

scala

apache-spark

apache-spark-sql

apache-spark-dataset

apache-spark-2.0

Georg Heiler

People also ask

1 Answers

zero323

Recent Activity

Donate For Us