Spark Build Custom Column Function, user defined function

Tags:

I’m using Scala and want to build my own DataFrame function. For example, I want to treat a column like an array , iterate through each element and make a calculation.

To start off, I’m trying to implement my own getMax method. So column x would have the values [3,8,2,5,9], and the expected output of the method would be 9.

Here is what it looks like in Scala

def getMax(inputArray: Array[Int]): Int = {
   var maxValue = inputArray(0)
   for (i <- 1 until inputArray.length if inputArray(i) > maxValue) {
     maxValue = inputArray(i)
   }
   maxValue
}

This is what I have so far, and get this error

"value length is not a member of org.apache.spark.sql.column",

and I don't know how else to iterate through the column.

def getMax(col: Column): Column = {
var maxValue = col(0)
for (i <- 1 until col.length if col(i) > maxValue){
    maxValue = col(i)
}
maxValue

}

Once I am able to implement my own method, I will create a column function

val value_max:org.apache.spark.sql.Column=getMax(df.col(“value”)).as(“value_max”)

And then I hope to be able to use this in a SQL statement, for example

val sample = sqlContext.sql("SELECT value_max(x) FROM table")

and the expected output would be 9, given input column [3,8,2,5,9]

I am following an answer from another thread Spark Scala - How do I iterate rows in dataframe, and add calculated values as new columns of the data frame where they create a private method for standard deviation. The calculations I will do will be more complex than this, (e.g I will be comparing each element in the column) , am I going in the correct directions or should I be looking more into User Defined Functions?

588

asked Apr 11 '16 10:04

other15

1 Answers

In a Spark DataFrame, you can't iterate through the elements of a Column using the approaches you thought of because a Column is not an iterable object.

However, to process the values of a column, you have some options and the right one depends on your task:

1) Using the existing built-in functions

Spark SQL already has plenty of useful functions for processing columns, including aggregation and transformation functions. Most of them you can find in the functions package (documentation here). Some others (binary functions in general) you can find directly in the Column object (documentation here). So, if you can use them, it's usually the best option. Note: don't forget the Window Functions.

2) Creating an UDF

If you can't complete your task with the built-in functions, you may consider defining an UDF (User Defined Function). They are useful when you can process each item of a column independently and you expect to produce a new column with the same number of rows as the original one (not an aggregated column). This approach is quite simple: first, you define a simple function, then you register it as an UDF, then you use it. Example:

def myFunc: (String => String) = { s => s.toLowerCase }

import org.apache.spark.sql.functions.udf
val myUDF = udf(myFunc)

val newDF = df.withColumn("newCol", myUDF(df("oldCol")))

For more information, here's a nice article.

3) Using an UDAF

If your task is to create aggregated data, you can define an UDAF (User Defined Aggregation Function). I don't have a lot of experience with this, but I can point you to a nice tutorial:

https://ragrawal.wordpress.com/2015/11/03/spark-custom-udaf-example/

4) Fall back to RDD processing

If you really can't use the options above, or if you processing task depends on different rows for processing one and it's not an aggregation, then I think you would have to select the column you want and process it using the corresponding RDD. Example:

val singleColumnDF = df("column")

val myRDD = singleColumnDF.rdd

// process myRDD

So, there was the options I could think of. I hope it helps.

151

answered Oct 16 '22 05:10

Daniel de Paula

Related questions
                            
                                Scala using Java libraries, taking advantage of lambda expressions support in Java 8
                            
                                Increase Spark memory when using local[*]
                            
                                Schema comparison of two dataframes in scala
                            
                                Unimporting in Scala
                            
                                Difference betwean RegexpParsers,StandardTokenParsers and JavaTokenParsers in scala
                            
                                How to use stackable trait pattern with Akka actors?
                            
                                Is there any advantage to avoiding while loops in Scala?
                            
                                Trying to cross compile a project to Scala 2.11 fails with "error while loading Object, Missing dependency 'object scala in compiler mirror'"
                            
                                In Scala, can generic type parameters be used with *function* definitions?
                            
                                Pattern Matching "case Nil" for Vector
                            
                                Pattern matching on testing expected message
                            
                                Can I code in Dotty in German (or at all) in IntelliJ?
                            
                                How to qualify methods as static in Scala?
                            
                                How can I call another task from my SBT task?
                            
                                How to get the number of workers(executors) in PySpark?
                            
                                Outputting 'null' for Option[T] in play-json serialization when value is None
                            
                                Use cases for different sbt Key operators
                            
                                Nulls in Scala ...why is this possible?
                            
                                Getting subclasses of a sealed trait
                            
                                How to define a function whose output type depends on the input type

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark Build Custom Column Function, user defined function

Tags:

scala

apache-spark

apache-spark-sql

other15

People also ask

1 Answers

Daniel de Paula

Recent Activity

Donate For Us