Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark scala data frame udf returning rows

Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed is associated with the former record, so I have to write an udf.) How should I write this udf?

I have tried to write a udf, which takes colA as input, and output Seq[Row] where each record contains the new filed. But the problem is the udf cannot return Seq[Row]/ The exception is 'Schema for type org.apache.spark.sql.Row is not supported'. What should I do?

The udf that I wrote: val convert = udf[Seq[Row], Seq[Row]](blablabla...) And the exception is java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported

like image 527
wttttt Avatar asked Apr 08 '18 03:04

wttttt


1 Answers

since spark 2.0 you can create UDFs which return Row / Seq[Row], but you must provide the schema for the return type, e.g. if you work with an Array of Doubles :

val schema = ArrayType(DoubleType)

val myUDF = udf((s: Seq[Row]) => {
  s // just pass data without modification
}, schema)

But I cant really imagine where this is useful, I would rather return tuples or case classes (or Seq thereof) from the UDFs.

EDIT : It could be useful if your row contains more than 22 fields (limit of fields for tuples/case classes)

like image 173
Raphael Roth Avatar answered Nov 08 '22 00:11

Raphael Roth