Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Issue with VectorUDT when using Spark ML

I am writing an UDAF to be applied to a Spark data frame column of type Vector (spark.ml.linalg.Vector). I rely on spark.ml.linalg package so that I do not have to go back and forth between dataframe and RDD.

Inside the UDAF, I have to specify a data type for the input, buffer, and output schemas:

def inputSchema = new StructType().add("features", new VectorUDT())
def bufferSchema: StructType =
    StructType(StructField("list_of_similarities", ArrayType(new VectorUDT(), true), true) :: Nil)

override def dataType: DataType = ArrayType(DoubleType,true) 

VectorUDT is what I would use with spark.mllib.linalg.Vector: https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/linalg/Vectors.scala

However, when I try to import it from spark.ml instead: import org.apache.spark.ml.linalg.VectorUDT I get a runtime error (no errors during the build):

class VectorUDT in package linalg cannot be accessed in package org.apache.spark.ml.linalg 

Is it expected/can you suggest a workaround?

I am using Spark 2.0.0

like image 283
Alexey Svyatkovskiy Avatar asked Aug 16 '16 17:08

Alexey Svyatkovskiy


People also ask

What is VectorUDT?

public class VectorUDT extends UserDefinedType<Vector> User-defined type for Vector which allows easy interaction with SQL via DataFrame .

How does spark ml work?

ML Dataset: Spark ML uses the SchemaRDD from Spark SQL as a dataset which can hold a variety of data types. E.g., a dataset could have different columns storing text, feature vectors, true labels, and predictions. Transformer : A Transformer is an algorithm which can transform one SchemaRDD into another SchemaRDD .


1 Answers

In Spark 2.0.0, the proper way to go is to use org.apache.spark.ml.linalg.SQLDataTypes.VectorType instead of VectorUDT. It was introduced in this issue.

like image 127
nedim Avatar answered Nov 01 '22 12:11

nedim