Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What Type should the dense vector be, when using UDF function in Pyspark? [duplicate]

I want to change List to Vector in pySpark, and then use this column to Machine Learning model for training. But my spark version is 1.6.0, which does not have VectorUDT(). So what type should I return in my udf function?

from pyspark.sql import SQLContext
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import *
from pyspark.mllib.linalg import DenseVector
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import *


conf = SparkConf().setAppName('rank_test')
sc = SparkContext(conf=conf)
spark = SQLContext(sc)


df = spark.createDataFrame([[[0.1,0.2,0.3,0.4,0.5]]],['a'])
print '???'
df.show()
def list2vec(column):
    print '?????',column
    return Vectors.dense(column)
getVector = udf(lambda y: list2vec(y),DenseVector() )
df.withColumn('b',getVector(col('a'))).show()

I have tried many Types , and this DenseVector() give me error:

Traceback (most recent call last):
  File "t.py", line 21, in <module>
    getVector = udf(lambda y: list2vec(y),DenseVector() )
TypeError: __init__() takes exactly 2 arguments (1 given)

Help me, please.

like image 214
nick_liu Avatar asked Apr 03 '18 06:04

nick_liu


People also ask

What is dense vector in Python?

A dense vector represented by a value array. We use numpy array for storage and arithmetics will be delegated to the underlying numpy array.

How does UDF Pyspark work?

PySpark UDF is a User Defined Function that is used to create a reusable function in Spark. Once UDF created, that can be re-used on multiple DataFrames and SQL (after registering). The default type of the udf() is StringType. You need to handle nulls explicitly otherwise you will see side-effects.

What is VectorUDT?

public class VectorUDT extends UserDefinedType<Vector> User-defined type for Vector which allows easy interaction with SQL via DataFrame .


1 Answers

You can use vectors and VectorUDT with UDF,

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import functions as F

ud_f = F.udf(lambda r : Vectors.dense(r),VectorUDT())
df = df.withColumn('b',ud_f('a'))
df.show()
+-------------------------+---------------------+
|a                        |b                    |
+-------------------------+---------------------+
|[0.1, 0.2, 0.3, 0.4, 0.5]|[0.1,0.2,0.3,0.4,0.5]|
+-------------------------+---------------------+

df.printSchema()
root
  |-- a: array (nullable = true)
  |    |-- element: double (containsNull = true)
  |-- b: vector (nullable = true)

About VectorUDT, http://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/linalg.html

like image 90
Suresh Avatar answered Sep 24 '22 12:09

Suresh