Defining a UDF that accepts an Array of objects in a Spark DataFrame?

Tags:

When working with Spark's DataFrames, User Defined Functions (UDFs) are required for mapping data in columns. UDFs require that argument types are explicitly specified. In my case, I need to manipulate a column that is made up of arrays of objects, and I do not know what type to use. Here's an example:

import sqlContext.implicits._

// Start with some data. Each row (here, there's only one row) 
// is a topic and a bunch of subjects
val data = sqlContext.read.json(sc.parallelize(Seq(
  """
  |{
  |  "topic" : "pets",
  |  "subjects" : [
  |    {"type" : "cat", "score" : 10},
  |    {"type" : "dog", "score" : 1}
  |  ]
  |}
  """)))

It's relatively straightforward to use the built-in org.apache.spark.sql.functions to perform basic operations on the data in the columns

import org.apache.spark.sql.functions.size
data.select($"topic", size($"subjects")).show

+-----+--------------+
|topic|size(subjects)|
+-----+--------------+
| pets|             2|
+-----+--------------+

and it's generally easy to write custom UDFs to perform arbitrary operations

import org.apache.spark.sql.functions.udf
val enhance = udf { topic : String => topic.toUpperCase() }
data.select(enhance($"topic"), size($"subjects")).show 

+----------+--------------+
|UDF(topic)|size(subjects)|
+----------+--------------+
|      PETS|             2|
+----------+--------------+

But what if I want to use a UDF to manipulate the array of objects in the "subjects" column? What type do I use for the argument in the UDF? For example, if I want to reimplement the size function, instead of using the one provided by spark:

val my_size = udf { subjects: Array[Something] => subjects.size }
data.select($"topic", my_size($"subjects")).show

Clearly Array[Something] does not work... what type should I use!? Should I ditch Array[] altogether? Poking around tells me scala.collection.mutable.WrappedArray may have something to do with it, but still there's another type I need to provide.

379

asked Aug 17 '16 21:08

mattsilver

1 Answers

What you're looking for is Seq[o.a.s.sql.Row]:

import org.apache.spark.sql.Row

val my_size = udf { subjects: Seq[Row] => subjects.size }

Explanation:

Current representation of ArrayType is, as you already know, WrappedArray so Array won't work and it is better to stay on the safe side.
According to the official specification, the local (external) type for StructType is Row. Unfortunately it means that access to the individual fields is not type safe.

Notes:

To create struct in Spark < 2.3, function passed to udf has to return Product type (Tuple* or case class), not Row. That's because corresponding udf variants depend on Scala reflection:

Defines a Scala closure of n arguments as user-defined function (UDF). The data types are automatically inferred based on the Scala closure's signature.
In Spark >= 2.3 it is possible to return Row directly, as long as the schema is provided.

def udf(f: AnyRef, dataType: DataType): UserDefinedFunction Defines a deterministic user-defined function (UDF) using a Scala closure. For this variant, the caller must specify the output data type, and there is no automatic input type coercion.

See for example How to create a Spark UDF in Java / Kotlin which returns a complex type?.

130

answered Oct 01 '22 02:10

zero323

Related questions
                            
                                How do I create a heterogeneous Array in Scala?
                            
                                Scalamock 3. Mock overloaded method without parameter
                            
                                Can't run Scalatest with Gradle
                            
                                String concatenation gone functional
                            
                                Result type of an implicit conversion must be more specific than AnyRef
                            
                                Can you specify type argument for None or tell compiler that it's an Option[String]?
                            
                                How to suppress Spark logging in unit tests?
                            
                                What is shuffle read & shuffle write in Apache Spark
                            
                                How to split string into equal-length substrings?
                            
                                How do you serialize a Map to JSON in Scala?
                            
                                Generating a frequency map for a string in Scala
                            
                                Are side effects everything that cannot be found in a pure function?
                            
                                Functional languages (Erlang, F#, Haskell, Scala) [closed]
                            
                                Clojure or Scala for bioinformatics/biostatistics/medical research [closed]
                            
                                Matching function literals with quasiquotes in Scala
                            
                                Example of contravariance
                            
                                Are there any documented anti-patterns for functional programming? [closed]
                            
                                Forward References - why does this code compile?
                            
                                JDT weaving is currently disabled
                            
                                How Clojure's agents compare to Scala's actors?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Defining a UDF that accepts an Array of objects in a Spark DataFrame?

Tags:

dataframe

scala

apache-spark

apache-spark-sql

user-defined-functions

mattsilver

People also ask

1 Answers

zero323

Recent Activity

Donate For Us