Given my pyspark Row object:
>>> row
Row(clicked=0, features=SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}))
>>> row.clicked
0
>>> row.features
SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752})
>>> type(row.features)
<class 'pyspark.ml.linalg.SparseVector'>
However, row.features failed to pass isinstance(row.features,Vector) test.
>>> isinstance(SparseVector(7, {0: 1.0, 3: 1.0, 6: 0.752}), Vector)
True
>>> isinstance(row.features, Vector)
False
>>> isinstance(deepcopy(row.features), Vector)
False
This strange error made me in huge trouble. Without passing "isinstance(row.features, Vector)," I am not able to generate LabeledPoint using map function. I will be really grateful if anyone can solve this problem.
It is is unlikely an error. You didn't provide a code required to reproduce the issue but most likely you use Spark 2.0 with ML transformers and you compare wrong entities.
Let's illustrate that with an example. Simple data
from pyspark.ml.feature import OneHotEncoder
row = OneHotEncoder(inputCol="x", outputCol="features").transform(
sc.parallelize([(1.0, )]).toDF(["x"])
).first()
Now lets import different vector classes:
from pyspark.ml.linalg import Vector as MLVector, Vectors as MLVectors
from pyspark.mllib.linalg import Vector as MLLibVector, Vectors as MLLibVectors
from pyspark.mllib.regression import LabeledPoint
and make tests:
isinstance(row.features, MLLibVector)
False
isinstance(row.features, MLVector)
True
As you see what we have is pyspark.ml.linalg.Vector
not pyspark.mllib.linalg.Vector
which is not compatible with the old API:
LabeledPoint(0.0, row.features)
TypeError Traceback (most recent call last)
...
TypeError: Cannot convert type <class 'pyspark.ml.linalg.SparseVector'> into Vector
You could convert ML object to MLLib one:
from pyspark.ml import linalg as ml_linalg
def as_mllib(v):
if isinstance(v, ml_linalg.SparseVector):
return MLLibVectors.sparse(v.size, v.indices, v.values)
elif isinstance(v, ml_linalg.DenseVector):
return MLLibVectors.dense(v.toArray())
else:
raise TypeError("Unsupported type: {0}".format(type(v)))
LabeledPoint(0, as_mllib(row.features))
LabeledPoint(0.0, (1,[],[]))
or simply:
LabeledPoint(0, MLLibVectors.fromML(row.features))
LabeledPoint(0.0, (1,[],[]))
but generally speaking you should avoid situations when it is necessary.
If you just want to convert the SparseVectors from pyspark.ml to pyspark.mllib SparseVectors you could use MLUtils. Say df is your dataframe and the column with the SparseVectors is named "features". Then the following few lines let you accomplish this:
from pyspark.mllib.util import MLUtils
df = MLUtils.convertVectorColumnsFromML(df, "features")
This problem occured for me, because when using CountVectorizer from pyspark.ml.feature I could not create LabeledPoints, because of the incompatibility with SparseVector from pyspark.ml
I wonder why their latest documentation CountVectorizer does not use the "new" SparseVector class. As the classification algorithms need LabeledPoints this makes no sense to me...
UPDATE: I misunderstood that the ml library is designed for DataFrame-Objects and the mllib library is for RDD-objects. The DataFrame-Datastructure is recommended since Spark > 2,0, because SparkSession is more compatible than the SparkContext (but stores an SparkContext-object) and does deliver DataFrame instead of RDD's. I found this post that got me the "aha"-effect: mllib and ml. Thanks Alberto Bonsanto :).
To use f.e. NaiveBayes from mllib, I had to transform my DataFrame into a LabeledPoint-objects for NaiveBayes from mllib.
But it is way easier to use NaiveBayes from ml because you don't need LabeledPoints but can just specify the feature- and class-col for your dataframe.
PS: I was struggling with this problems for hours, so I felt that I need to post it here :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With