How to convert column to vector type?

Tags:

I have an RDD in Spark where the objects are based on a case class:

ExampleCaseClass(user: User, stuff: Stuff)

I want to use Spark's ML pipeline, so I convert this to a Spark data frame. As part of the pipeline, I want to transform one of the columns into a column whose entries are vectors. Since I want the length of that vector to vary with the model, it should be built into the pipeline as part of the feature transformation.

So I attempted to define a Transformer as follows:

class MyTransformer extends Transformer {

  val uid = ""
  val num: IntParam = new IntParam(this, "", "")

  def setNum(value: Int): this.type = set(num, value)
  setDefault(num -> 50)

  def transform(df: DataFrame): DataFrame = {
    ...
  }

  def transformSchema(schema: StructType): StructType = {
    val inputFields = schema.fields
    StructType(inputFields :+ StructField("colName", ???, true))
  }

  def copy (extra: ParamMap): Transformer = defaultCopy(extra)

}

How do I specify the DataType of the resulting field (i.e. fill in the ???)? It will be a Vector of some simple class (Boolean, Int, Double, etc). It seems VectorUDT might have worked, but that's private to Spark. Since any RDD can be converted to a DataFrame, any case class can be converted to a custom DataType. However I can't figure out how to manually do this conversion, otherwise I could apply it to some simple case class wrapping the vector.

Furthermore, if I specify a vector type for the column, will VectorAssembler correctly process the vector into separate features when I go to fit the model?

Still new to Spark and especially to the ML Pipeline, so appreciate any advice.

298

asked Mar 18 '16 01:03

mkreisel

2 Answers

import org.apache.spark.ml.linalg.SQLDataTypes.VectorType  
def transformSchema(schema: StructType): StructType = {
  val inputFields = schema.fields
  StructType(inputFields :+ StructField("colName", VectorType, true))
}

In spark 2.1 VectorType makes VectorUDT publicly available:

package org.apache.spark.ml.linalg

import org.apache.spark.annotation.{DeveloperApi, Since}
import org.apache.spark.sql.types.DataType

/**
 * :: DeveloperApi ::
 * SQL data types for vectors and matrices.
 */
@Since("2.0.0")
@DeveloperApi
object SQLDataTypes {

  /** Data type for [[Vector]]. */
  val VectorType: DataType = new VectorUDT

  /** Data type for [[Matrix]]. */
  val MatrixType: DataType = new MatrixUDT
}

answered Sep 21 '22 22:09

Thomas Luechtefeld

import org.apache.spark.mllib.linalg.{Vector, Vectors}

case class MyVector(vector: Vector)
val vectorDF = Seq(
  MyVector(Vectors.dense(1.0,3.4,4.4)),
  MyVector(Vectors.dense(5.5,6.7))
).toDF

vectorDF.printSchema
root
 |-- vector: vector (nullable = true)

println(vectorDF.schema.fields(0).dataType.prettyJson)
{
  "type" : "udt",
  "class" : "org.apache.spark.mllib.linalg.VectorUDT",
  "pyClass" : "pyspark.mllib.linalg.VectorUDT",
  "sqlType" : {
    "type" : "struct",
    "fields" : [ {
      "name" : "type",
      "type" : "byte",
      "nullable" : false,
      "metadata" : { }
    }, {
      "name" : "size",
      "type" : "integer",
      "nullable" : true,
      "metadata" : { }
    }, {
      "name" : "indices",
      "type" : {
        "type" : "array",
        "elementType" : "integer",
        "containsNull" : false
      },
      "nullable" : true,
      "metadata" : { }
    }, {
      "name" : "values",
      "type" : {
        "type" : "array",
        "elementType" : "double",
        "containsNull" : false
      },
      "nullable" : true,
      "metadata" : { }
    } ]
  }
}

answered Sep 19 '22 22:09

David Griffin

Related questions
                            
                                Spark Streaming Window Operation
                            
                                scala future error for " Don't call `Awaitable` methods directly, use the `Await` object."
                            
                                how to flatten disjunction type
                            
                                How do I consume -D variables in build.scala using SBT?
                            
                                What's the purpose of macros?
                            
                                Scala version of Jgit
                            
                                Scala Play no application started when grabbing data sources from application.conf
                            
                                Uncaught exception during compilation: java.lang.AssertionError
                            
                                Apache Spark - How does internal job scheduler in spark define what are users and what are pools
                            
                                How to implement Future as Applicative in Scala?
                            
                                How to pattern match a scala immutable queue?
                            
                                Json implicit format with recursive class definition
                            
                                Cannot resolve symbol 'play' error with Play Framework 2.4.x and IntellijIdea 14.x
                            
                                SBT: Exclude resource subdirectory
                            
                                On Spark's RDD's take and takeOrdered methods
                            
                                Operate on neighbor elements in RDD in Spark
                            
                                Kryo serializer causing exception on underlying Scala class WrappedArray
                            
                                Add a compile time only sub-project dependency in sbt
                            
                                scala.js — getting complex objects from JavaScript
                            
                                reduce() vs. fold() in Apache Spark

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to convert column to vector type?

Tags:

scala

apache-spark

apache-spark-ml

mkreisel

People also ask

2 Answers

Thomas Luechtefeld

David Griffin

Recent Activity

Donate For Us