Spark UDF for StructType / Row

Tags:

I have a "StructType" column in spark Dataframe that has an array and a string as sub-fields. I'd like to modify the array and return the new column of the same type. Can I process it with UDF? Or what are the alternatives?

import org.apache.spark.sql.types._ import org.apache.spark.sql.Row val sub_schema = StructType(StructField("col1",ArrayType(IntegerType,false),true) :: StructField("col2",StringType,true)::Nil) val schema = StructType(StructField("subtable", sub_schema,true) :: Nil) val data = Seq(Row(Row(Array(1,2),"eb")),  Row(Row(Array(3,2,1), "dsf")) ) val rd = sc.parallelize(data) val df = spark.createDataFrame(rd, schema) df.printSchema  root  |-- subtable: struct (nullable = true)  |    |-- col1: array (nullable = true)  |    |    |-- element: integer (containsNull = false)  |    |-- col2: string (nullable = true)

It seems that I need a UDF of the type Row, something like

val u =  udf((x:Row) => x)        >> Schema for type org.apache.spark.sql.Row is not supported

This makes sense, since Spark does not know the schema for the return type. Unfortunately, udf.register fails too:

spark.udf.register("foo", (x:Row)=> Row, sub_schema)      <console>:30: error: overloaded method value register with alternatives: ...

780

asked Mar 21 '17 15:03

Danil Kirsanov

2 Answers

turns out you can pass the result schema as a second UDF parameter:

val u =  udf((x:Row) => x, sub_schema)

answered Oct 02 '22 20:10

Danil Kirsanov

You are on the right track. In this scenario UDF will make your life easy. As you have already encountered, UDF can not return types which spark does not know about. So basically you will need return something which spark can easily serialize. It may be a case class or you can return a tuple like (Seq[Int], String) . So here is a modified version of your code:

def main(args: Array[String]): Unit = {   import org.apache.spark.sql.Row   import org.apache.spark.sql.functions._   import org.apache.spark.sql.types._   val sub_schema = StructType(StructField("col1", ArrayType(IntegerType, false), true) :: StructField("col2", StringType, true) :: Nil)   val schema = StructType(StructField("subtable", sub_schema, true) :: Nil)   val data = Seq(Row(Row(Array(1, 2), "eb")), Row(Row(Array(3, 2, 1), "dsf")))   val rd = spark.sparkContext.parallelize(data)   val df = spark.createDataFrame(rd, schema)    df.printSchema()   df.show(false)    val mapArray = (subRows: Row) => {     // I prefer reading values from row by specifying column names, you may use index also     val col1 = subRows.getAs[Seq[Int]]("col1")     val mappedCol1 = col1.map(x => x * x) // Use map based on your requirements     (mappedCol1, subRows.getAs[String]("col2")) // now mapping is done for col2   }   val mapUdf = udf(mapArray)    val newDf = df.withColumn("col1_mapped", mapUdf(df("subtable")))   newDf.show(false)   newDf.printSchema() }

Please take a look at these links, these may give you more insight.

Most comprehensive answer on working with complex schema: https://stackoverflow.com/a/33850490/4046067
Spark supported data types: https://spark.apache.org/docs/latest/sql-programming-guide.html#data-types

answered Oct 02 '22 20:10

Tawkir

Related questions
                            
                                Button click event binding in Electron js
                            
                                R Markdown YAML "Scanner error: mapping values..."
                            
                                Firebase hosting with multiple directories at multiple domains
                            
                                How to save Oauth Access token securely in android
                            
                                Why is declaration order important for passing a member function pointer as a template argument?
                            
                                How to make a word underline in Markdown
                            
                                Extract a function name inside a macro
                            
                                What additional work is done by np.power?
                            
                                How to use AuthorizationServerSecurityConfigurer?
                            
                                Target .NET Core 2.0 with Azure Functions in Visual Studio 2017
                            
                                Using django signals in channels consumer classes
                            
                                VirtualBox install failed in a clean OSX 10.13

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark UDF for StructType / Row

Tags:

Danil Kirsanov

People also ask

2 Answers

Danil Kirsanov

Tawkir

Recent Activity

Donate For Us