I want to change List to Vector in pySpark, and then use this column to Machine Learning model for training. But my spark version is 1.6.0, which does not have VectorUDT(). So what type should I return in my udf function?
from pyspark.sql import SQLContext
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import *
from pyspark.mllib.linalg import DenseVector
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import *
conf = SparkConf().setAppName('rank_test')
sc = SparkContext(conf=conf)
spark = SQLContext(sc)
df = spark.createDataFrame([[[0.1,0.2,0.3,0.4,0.5]]],['a'])
print '???'
df.show()
def list2vec(column):
print '?????',column
return Vectors.dense(column)
getVector = udf(lambda y: list2vec(y),DenseVector() )
df.withColumn('b',getVector(col('a'))).show()
I have tried many Types , and this DenseVector() give me error:
Traceback (most recent call last):
File "t.py", line 21, in <module>
getVector = udf(lambda y: list2vec(y),DenseVector() )
TypeError: __init__() takes exactly 2 arguments (1 given)
Help me, please.
A dense vector represented by a value array. We use numpy array for storage and arithmetics will be delegated to the underlying numpy array.
PySpark UDF is a User Defined Function that is used to create a reusable function in Spark. Once UDF created, that can be re-used on multiple DataFrames and SQL (after registering). The default type of the udf() is StringType. You need to handle nulls explicitly otherwise you will see side-effects.
public class VectorUDT extends UserDefinedType<Vector> User-defined type for Vector which allows easy interaction with SQL via DataFrame .
You can use vectors and VectorUDT with UDF,
from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import functions as F
ud_f = F.udf(lambda r : Vectors.dense(r),VectorUDT())
df = df.withColumn('b',ud_f('a'))
df.show()
+-------------------------+---------------------+
|a |b |
+-------------------------+---------------------+
|[0.1, 0.2, 0.3, 0.4, 0.5]|[0.1,0.2,0.3,0.4,0.5]|
+-------------------------+---------------------+
df.printSchema()
root
|-- a: array (nullable = true)
| |-- element: double (containsNull = true)
|-- b: vector (nullable = true)
About VectorUDT, http://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/linalg.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With