There is something very annoying with the function of VectorAssembler. I am currently transforming a set of columns into a single column of vectors and then use the StandardScaler function to apply the scaling to the included features. However, there seems that SPARK for memory reasons, decides whether it should use a DenseVector or a SparseVector to represent each row of features. But, when you need to use StandardScaler, the input of SparseVector(s) is invalid, only DenseVectors are allowed. Does anybody know a solution to that?
Edit: I decided to just use a UDF function instead, which turns the sparse vector into a dense vector. Kind of silly but works.
You're right that VectorAssembler
chooses dense vs sparse output format based on whichever one uses less memory.
You don't need a UDF to convert from SparseVector
to DenseVector
; just use toArray()
method:
from pyspark.ml.linalg import SparseVector, DenseVector
a = SparseVector(4, [1, 3], [3.0, 4.0])
b = DenseVector(a.toArray())
Also, StandardScaler
accepts SparseVector
unless you set withMean=True
at creation. If you do need to de-mean, you have to deduct a (presumably non-zero) number from all the components, so the sparse vector won't be sparse any more.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With