Looking at the <code>select()</code> function on the spark DataSet there are various generated function signatures: <pre class="prettyprint"><code>(c1: TypedColumn[MyClass, U1],c2: TypedColumn[MyClass, U2] ....) </code></pre> This seems to hint that I should be able to reference the members of MyClass directly and be type safe, but I'm not sure how... <code>ds.select("member")</code> of course works .. seems like <code>ds.select(_.member)</code> might also work somehow?

In the Scala DSL for <code>select</code>, there are many ways to identify a <code>Column</code>: <ul> <li>From a symbol: <code>'name</code> </li> <li>From a string: <code>$"name"</code> or <code>col(name)</code> </li> <li>From an expression: <code>expr("nvl(name, 'unknown') as renamed")</code> </li> </ul> To get a <code>TypedColumn</code> from <code>Column</code> you simply use <code>myCol.as[T]</code>. For example: <code>ds.select(col("name").as[String])</code>

Spark Dataset select with typedcolumn

Tags:

scala

apache-spark

apache-spark-dataset

Looking at the select() function on the spark DataSet there are various generated function signatures:

(c1: TypedColumn[MyClass, U1],c2: TypedColumn[MyClass, U2] ....)

This seems to hint that I should be able to reference the members of MyClass directly and be type safe, but I'm not sure how...

ds.select("member") of course works .. seems like ds.select(_.member) might also work somehow?

897

asked Jul 28 '16 16:07

Jeremy

2 Answers

In the Scala DSL for select, there are many ways to identify a Column:

From a symbol: 'name
From a string: $"name" or col(name)
From an expression: expr("nvl(name, 'unknown') as renamed")

To get a TypedColumn from Column you simply use myCol.as[T].

For example: ds.select(col("name").as[String])

143

answered Oct 17 '22 09:10

Sim

If you want the equivalent of ds.select(_.member) just use map:

case class MyClass(member: MyMember, foo: A, bar: B)
val ds: DataSet[MyClass] = ???
val members: DataSet[MyMember] = ds.map(_.member)

Edit: The argument for not using map.

A more performant way of doing the same would be through a projection, and not use map at all. You lose the compile-time type checking, but in exchange give the Catalyst query engine a chance to do something more optimized. As @Sim alludes to in his comment below, the primary optimization centers around not requiring whole contents of MyClass to be deserialized from Tungsten memory space into JVM heap memory--just to call the accessor--and then serialize the result of _.member back into Tungsten.

To make a more concrete example, let's redefine our data model like this:

  // Make sure these are not nested classes 
  // (i.e. in a top level compilation units).
  case class MyMember(something: Double)
  case class MyClass(member: MyMember, foo: Int, bar: String)

These need to be case classes so that SQLImplicits.newProductEncoder[T <: Product] can provide us with an implicit Encoder[MyClass], required by the Dataset[T] API.

Now we can make the example above more concrete:

  val ds: Dataset[MyClass] = Seq(MyClass(MyMember(1.0), 2, "three")).toDS()
  val membersMapped: Dataset[Double] = ds.map(_.member.something)

To see what's going on behind the scenes we use the explain() method:

membersMapped.explain()

== Physical Plan ==
*(1) SerializeFromObject [input[0, double, false] AS value#19]
+- *(1) MapElements <function1>, obj#18: double
   +- *(1) DeserializeToObject newInstance(class MyClass), obj#17: MyClass
      +- LocalTableScan [member#12, foo#13, bar#14]

This makes the serialization to/from Tungsten explicitly evident.

Let's get to the same value using a projection[^1]:

val ds2: Dataset[Double] = ds.select($"member.something".as[Double])
ds2.explain()

== Physical Plan ==
LocalTableScan [something#25]

That's it! A single step[^2]. No serialization other than the encoding of MyClass into the original Dataset.

[^1]: The reason the projection is defined as $"member.something" rather than $"value.member.something" has to do with Catalyst automatically projecting the members of a single column DataFrame.

[^2]: To be fair, the * next to the steps in the first physical plan indicate they will be implemented by a WholeStageCodegenExec whereby those steps become a single, on-the-fly compiled JVM function that has its own set of runtime optimizations applied to it. So in practice you'd have to empirically test the performance to really assess the benefits to each approach.

answered Oct 17 '22 09:10

metasim

Related questions
                            
                                Circe and Scala's Enumeration type
                            
                                How does Scala attain parallelism?
                            
                                Scala XML: create a node not using literals
                            
                                Scala: return has its place
                            
                                Custom mapping to nested case class structure in Slick (more than 22 columns)
                            
                                How is this case class match pattern working?
                            
                                List of every n-th item in a given list
                            
                                Can I override a scala class method with a method from a trait?
                            
                                Why no immutable double linked list in Scala collections?
                            
                                Execute code before and after specification
                            
                                NotSerializableException for `Map[String, String]` alias
                            
                                Print whole result in interactive Scala console
                            
                                How to access LayoutManager from RecyclerView.Adapter to get scrollToPosition?
                            
                                Why does Seq.contains accept type Any rather than the type parameter A?
                            
                                How to find index of element with minimum value?
                            
                                Why does the type parameter bound T <: Comparable[T] fail for T = Int?
                            
                                What does "no global type inference" mean regarding Scala?
                            
                                What is the Java equivalent of a Scala object?
                            
                                Creating `**` power operator for Scala?
                            
                                Compare two Maps in Scala

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With