Say I have an dataframe which contains a column (called colA) which is a seq of row. I want to to append a new field to each record of colA. (And the new filed is associated with the former record, so I have to write an udf.) How should I write this udf?
I have tried to write a udf, which takes colA as input, and output Seq[Row] where each record contains the new filed. But the problem is the udf cannot return Seq[Row]/ The exception is 'Schema for type org.apache.spark.sql.Row is not supported'. What should I do?
The udf that I wrote:
val convert = udf[Seq[Row], Seq[Row]](blablabla...)
And the exception is java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.Row is not supported
since spark 2.0 you can create UDFs which return Row
/ Seq[Row]
, but you must provide the schema for the return type, e.g. if you work with an Array of Doubles :
val schema = ArrayType(DoubleType)
val myUDF = udf((s: Seq[Row]) => {
s // just pass data without modification
}, schema)
But I cant really imagine where this is useful, I would rather return tuples or case classes (or Seq thereof) from the UDFs.
EDIT : It could be useful if your row contains more than 22 fields (limit of fields for tuples/case classes)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With