I am writing an UDAF to be applied to a Spark data frame column of type Vector (spark.ml.linalg.Vector). I rely on spark.ml.linalg package so that I do not have to go back and forth between dataframe and RDD.
Inside the UDAF, I have to specify a data type for the input, buffer, and output schemas:
def inputSchema = new StructType().add("features", new VectorUDT())
def bufferSchema: StructType =
StructType(StructField("list_of_similarities", ArrayType(new VectorUDT(), true), true) :: Nil)
override def dataType: DataType = ArrayType(DoubleType,true)
VectorUDT is what I would use with spark.mllib.linalg.Vector: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala
However, when I try to import it from spark.ml instead: import org.apache.spark.ml.linalg.VectorUDT
I get a runtime error (no errors during the build):
class VectorUDT in package linalg cannot be accessed in package org.apache.spark.ml.linalg
Is it expected/can you suggest a workaround?
I am using Spark 2.0.0
public class VectorUDT extends UserDefinedType<Vector> User-defined type for Vector which allows easy interaction with SQL via DataFrame .
ML Dataset: Spark ML uses the SchemaRDD from Spark SQL as a dataset which can hold a variety of data types. E.g., a dataset could have different columns storing text, feature vectors, true labels, and predictions. Transformer : A Transformer is an algorithm which can transform one SchemaRDD into another SchemaRDD .
In Spark 2.0.0, the proper way to go is to use org.apache.spark.ml.linalg.SQLDataTypes.VectorType
instead of VectorUDT
. It was introduced in this issue.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With