Spark UDF with varargs

Tags:

Is it an only option to list all the arguments up to 22 as shown in documentation?

https://spark.apache.org/docs/1.5.0/api/scala/index.html#org.apache.spark.sql.UDFRegistration

Anyone figured out how to do something similar to this?

sc.udf.register("func", (s: String*) => s......

(writing custom concat function that skips nulls, had to 2 arguments at the time)

Thanks

660

asked Oct 15 '15 14:10

devopslife

1 Answers

UDFs don't support varargs* but you can pass an arbitrary number of columns wrapped using an array function:

import org.apache.spark.sql.functions.{udf, array, lit}

val myConcatFunc = (xs: Seq[Any], sep: String) => 
  xs.filter(_ != null).mkString(sep)

val myConcat = udf(myConcatFunc)

An example usage:

val  df = sc.parallelize(Seq(
  (null, "a", "b", "c"), ("d", null, null, "e")
)).toDF("x1", "x2", "x3", "x4")

val cols = array($"x1", $"x2", $"x3", $"x4")
val sep = lit("-")

df.select(myConcat(cols, sep).alias("concatenated")).show

// +------------+
// |concatenated|
// +------------+
// |       a-b-c|
// |         d-e|
// +------------+

With raw SQL:

df.registerTempTable("df")
sqlContext.udf.register("myConcat", myConcatFunc)

sqlContext.sql(
    "SELECT myConcat(array(x1, x2, x4), '.') AS concatenated FROM df"
).show

// +------------+
// |concatenated|
// +------------+
// |         a.c|
// |         d.e|
// +------------+

A slightly more complicated approach is not use UDF at all and compose SQL expressions with something roughly like this:

import org.apache.spark.sql.functions._
import org.apache.spark.sql.Column

def myConcatExpr(sep: String, cols: Column*) = regexp_replace(concat(
  cols.foldLeft(lit(""))(
    (acc, c) => when(c.isNotNull, concat(acc, c, lit(sep))).otherwise(acc)
  )
), s"($sep)?$$", "") 

df.select(
  myConcatExpr("-", $"x1", $"x2", $"x3", $"x4").alias("concatenated")
).show
// +------------+
// |concatenated|
// +------------+
// |       a-b-c|
// |         d-e|
// +------------+

but I doubt it is worth the effort unless you work with PySpark.

* If you pass a function using varargs it will be stripped from all the syntactic sugar and resulting UDF will expect an ArrayType. For example:

def f(s: String*) = s.mkString
udf(f _)

will be of type:

UserDefinedFunction(<function1>,StringType,List(ArrayType(StringType,true)))

answered Sep 28 '22 07:09

zero323

Related questions
                            
                                What does "trait A <: B" mean?
                            
                                Scala, C# equivalent of F# active patterns
                            
                                How can I create a read-only class member in Scala?
                            
                                What is the scala percent operator (%) and at method for strings do?
                            
                                Isn't the argument type co- not contra-variant?
                            
                                Fastest serialization/deserialization of Scala case classes
                            
                                Scala.NotImplementedError: an implementation is missing?
                            
                                MultiMap in Scala
                            
                                Which Scala features are internally implemented using reflection?
                            
                                Why is Scala's Addable deprecated?
                            
                                How to run scala code on Intellij Idea 11?
                            
                                What does a "?" symbol (question mark) mean in Scala?
                            
                                'jvm-1.8' is not a valid choice for '-target'
                            
                                Declare a variable without an initial value
                            
                                What new features will be added to Scala 2.9?
                            
                                Catching an exception within a map
                            
                                Add tools.jar in the classpath of sbt project
                            
                                Assign multiple variables at once in scala
                            
                                Is Option GenTraversableOnce?
                            
                                How do I wait for asynchronous tasks to complete in scala?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark UDF with varargs

Tags:

scala

apache-spark

udf

devopslife

People also ask

1 Answers

zero323

Recent Activity

Donate For Us