What Type should the dense vector be, when using UDF function in Pyspark? [duplicate]

Tags:

I want to change List to Vector in pySpark, and then use this column to Machine Learning model for training. But my spark version is 1.6.0, which does not have VectorUDT(). So what type should I return in my udf function?

from pyspark.sql import SQLContext
from pyspark import SparkContext, SparkConf
from pyspark.sql.functions import *
from pyspark.mllib.linalg import DenseVector
from pyspark.mllib.linalg import Vectors
from pyspark.sql.types import *


conf = SparkConf().setAppName('rank_test')
sc = SparkContext(conf=conf)
spark = SQLContext(sc)


df = spark.createDataFrame([[[0.1,0.2,0.3,0.4,0.5]]],['a'])
print '???'
df.show()
def list2vec(column):
    print '?????',column
    return Vectors.dense(column)
getVector = udf(lambda y: list2vec(y),DenseVector() )
df.withColumn('b',getVector(col('a'))).show()

I have tried many Types , and this DenseVector() give me error:

Traceback (most recent call last):
  File "t.py", line 21, in <module>
    getVector = udf(lambda y: list2vec(y),DenseVector() )
TypeError: __init__() takes exactly 2 arguments (1 given)

Help me, please.

214

asked Apr 03 '18 06:04

nick_liu

1 Answers

You can use vectors and VectorUDT with UDF,

from pyspark.ml.linalg import Vectors, VectorUDT
from pyspark.sql import functions as F

ud_f = F.udf(lambda r : Vectors.dense(r),VectorUDT())
df = df.withColumn('b',ud_f('a'))
df.show()
+-------------------------+---------------------+
|a                        |b                    |
+-------------------------+---------------------+
|[0.1, 0.2, 0.3, 0.4, 0.5]|[0.1,0.2,0.3,0.4,0.5]|
+-------------------------+---------------------+

df.printSchema()
root
  |-- a: array (nullable = true)
  |    |-- element: double (containsNull = true)
  |-- b: vector (nullable = true)

About VectorUDT, http://spark.apache.org/docs/2.2.0/api/python/_modules/pyspark/ml/linalg.html

answered Sep 24 '22 12:09

Suresh

Related questions
                            
                                How to create a confirmation popup for class.DeleteView
                            
                                Splitting a dataframe into separate CSV files
                            
                                Trouble converting string to float in python
                            
                                Create a pandas dataframe from a nested lists of unequal lengths
                            
                                Add a validator to a Mongodb collection with pymongo
                            
                                Merge rows within a group together
                            
                                Convert string to float pandas
                            
                                Correlation between two non-numeric columns in a Pandas DataFrame
                            
                                How to flatten an xarray dataset into a 1D numpy array?
                            
                                insert missing category for each group in pandas dataframe
                            
                                How to pass the parameter to User-Defined Function?
                            
                                Add a vertical label to matplotlib colormap legend
                            
                                Bash Script to Conda Install requirements.txt with PIP follow-up
                            
                                Django restrict data that can be given to model field
                            
                                Use both sample_weight and class_weight simultaneously
                            
                                Convert strings to float in all pandas columns, where this is possible
                            
                                Iterate Over Dictionary
                            
                                How to use ridge detection filter in opencv
                            
                                Python: Why return-type of itemgetter is not consistent
                            
                                how to print a tuple of tuples without brackets

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What Type should the dense vector be, when using UDF function in Pyspark? [duplicate]

Tags:

python

machine-learning

apache-spark

pyspark

apache-spark-mllib

nick_liu

People also ask

1 Answers

Suresh

Recent Activity

Donate For Us