Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Dataset filter: eta expansion is not done automatically

Tags:

If I have a simple Scala collection of Ints and I define a simple method isPositive to return true if the value is greater than 0, then I can just pass the method to the filter method of the collection, like in the example below

def isPositive(i: Int): Boolean = i > 0

val aList = List(-3, -2, -1, 1, 2, 3)
val newList = aList.filter(isPositive)

> newList: List[Int] = List(1, 2, 3)

So as far as I understand, the compiler is able to automatically convert the method into a function instance by doing eta expansion, and then it passes this function as parameter.

However, if I do the same thing with a Spark Dataset:

val aDataset = aList.toDS
val newDataset = aDataset.filter(isPositive)

> error

It fails with the well-known "missing arguments for method" error. To make it work, I have to explicitly convert the method into a function by using "_":

val newDataset = aDataset.filter(isPositive _)

> newDataset: org.apache.spark.sql.Dataset[Int] = [value: int]

Although with map it works as expected:

val newDataset = aDataset.map(isPositive)

> newDataset: org.apache.spark.sql.Dataset[Boolean] = [value: boolean]

Investigating the signatures, I see that the signature for Dataset's filter is very similar to List's filter:

// Dataset:
def filter(func: T => Boolean): Dataset[T]

// List (Defined in TraversableLike):
def filter(p: A => Boolean): Repr

So, why isn't the compiler doing eta expansion for the Dataset's filter operation?

like image 560
Daniel de Paula Avatar asked Aug 09 '17 13:08

Daniel de Paula


1 Answers

This is due to the nature of overloaded methods and ETA expansion. Eta-expansion between methods and functions with overloaded methods in Scala explains why this fails.

The gist of it is the following (emphasis mine):

when overloaded, applicability is undermined because there is no expected type (6.26.3, infamously). When not overloaded, 6.26.2 applies (eta expansion) because the type of the parameter determines the expected type. When overloaded, the arg is specifically typed with no expected type, hence 6.26.2 doesn't apply; therefore neither overloaded variant of d is deemed to be applicable.

.....

Candidates for overloading resolution are pre-screened by "shape". The shape test encapsulates the intuition that eta-expansion is never used because args are typed without an expected type. This example shows that eta-expansion is not used even when it is "the only way for the expression to type check."

As @DanielDePaula points out, the reason we don't see this effect in DataSet.map is because the overloaded method actually takes an additional Encoder[U] parameter:

def map[U : Encoder](func: T => U): Dataset[U] = withTypedPlan {
  MapElements[T, U](func, logicalPlan)
}

def map[U](func: MapFunction[T, U], encoder: Encoder[U]): Dataset[U] = {
  implicit val uEnc = encoder
  withTypedPlan(MapElements[T, U](func, logicalPlan))
}
like image 75
Yuval Itzchakov Avatar answered Oct 11 '22 14:10

Yuval Itzchakov