I have a UDF which returns a dataframe. Something like the one below
scala> predict_churn(Vectors.dense(2.0,1.0,0.0,3.0,4.0,4.0,0.0,4.0,5.0,2.0))
res3: org.apache.spark.sql.DataFrame = [noprob: string, yesprob: string, pred: string]
scala> predict_churn(Vectors.dense(2.0,1.0,0.0,3.0,4.0,4.0,0.0,4.0,5.0,2.0)).show
+------------------+------------------+----+
| noprob| yesprob|pred|
+------------------+------------------+----+
|0.3619977592578127|0.6380022407421874| 1.0|
+------------------+------------------+----+
however when i try to register this as a UDF using the command
hiveContext.udf.register("predict_churn", outerpredict _)
i get an error like
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.sql.DataFrame is not supported
at org.apache.spark.sql.catalyst.ScalaReflection$class.schemaFor(ScalaReflection.scala:715)
Is returning dataframe not supported. I am using Spark 1.6.1 and Scala 2.10. If this is not supported how can i return multiple columns to an external program please.
Thanks
Bala
Is returning dataframe not supported
Correct - you can't return a DataFrame from a UDF. UDFs should return types that are convertable into the supported column types:
In your case, you can use a case class:
case class Record(noprob: Double, yesprob: Double, pred: Double)
And have your UDF (predict_churn) return Record.
Then, when applied to a single record (as UDFs are), this case class will be converted into columns named as its members (and with the correct types), resulting in a DataFrame similar to the one currently returned by your function.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With