Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Define return value in Spark Scala UDF

Imagine the following code:

def myUdf(arg: Int) = udf((vector: MyData) => {
  // complex logic that returns a Double
})

How can I define the return type for myUdf so that people looking at the code will know immediately that it returns a Double?

like image 468
Marsellus Wallace Avatar asked May 31 '17 18:05

Marsellus Wallace


People also ask

How does UDF define Scala Spark?

In Spark, you create UDF by creating a function in a language you prefer to use for Spark. For example, if you are using Spark with scala, you create a UDF in scala language and wrap it with udf() function or register it as udf to use it on DataFrame and SQL respectively.

How do you define UDF?

A user-defined function (UDF) is a function provided by the user of a program or environment, in a context where the usual assumption is that functions are built into the program or environment. UDFs are usually written for the requirement of its creator.

Why should we avoid UDF in Spark?

1)When we use UDFs we end up losing all the optimization Spark does on our Dataframe/Dataset. When we use a UDF, it is as good as a Black box to Spark's optimizer. Let's consider an example of a general optimization when reading data from Database or columnar format files such as Parquet is PredicatePushdown.

Should we use UDF in Spark?

It is quite simple: it is recommended to rely as much as possible on Spark's built-in functions and only use a UDF when your transformation can't be done with the built-in functions. UDFs cannot be optimized by Spark's Catalyst optimizer, so there is always a potential decrease in performance.


1 Answers

I see two ways to do it, either define a method first and then lift it to a function

def myMethod(vector:MyData) : Double = {
  // complex logic that returns a Double
}

val myUdf = udf(myMethod _)

or define a function first with explicit type:

val myFunction: Function1[MyData,Double] = (vector:MyData) => {
  // complex logic that returns a Double
}

val myUdf = udf(myFunction)

I normally use the firt approach for my UDFs

like image 58
Raphael Roth Avatar answered Sep 29 '22 22:09

Raphael Roth