Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

VectorAssembler output only to DenseVector?

There is something very annoying with the function of VectorAssembler. I am currently transforming a set of columns into a single column of vectors and then use the StandardScaler function to apply the scaling to the included features. However, there seems that SPARK for memory reasons, decides whether it should use a DenseVector or a SparseVector to represent each row of features. But, when you need to use StandardScaler, the input of SparseVector(s) is invalid, only DenseVectors are allowed. Does anybody know a solution to that?

Edit: I decided to just use a UDF function instead, which turns the sparse vector into a dense vector. Kind of silly but works.

like image 885
ml_0x Avatar asked Mar 07 '16 12:03

ml_0x


1 Answers

You're right that VectorAssembler chooses dense vs sparse output format based on whichever one uses less memory.

You don't need a UDF to convert from SparseVector to DenseVector; just use toArray() method:

from pyspark.ml.linalg import SparseVector, DenseVector 
a = SparseVector(4, [1, 3], [3.0, 4.0])
b = DenseVector(a.toArray())

Also, StandardScaler accepts SparseVector unless you set withMean=True at creation. If you do need to de-mean, you have to deduct a (presumably non-zero) number from all the components, so the sparse vector won't be sparse any more.

like image 55
max Avatar answered Nov 04 '22 09:11

max