I have the following code for linear regression using pyspark.ml package. However I get this error message for the last line, when the model is being fit:
IllegalArgumentException: u'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.
Does anyone has an idea what is missing?
Is there any replacement in pyspark.ml
for LabeledPoint
in pyspark.mllib
?
from pyspark import SparkContext
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.regression import LabeledPoint
import numpy as np
from pandas import *
data = sc.textFile("/FileStore/tables/w7baik1x1487076820914/randomTableSmall.csv")
def parsePoint(line):
values = [float(x) for x in line.split(',')]
return LabeledPoint(values[1], [values[0]])
points_df = data.map(parsePoint).toDF()
lr = LinearRegression()
model = lr.fit(points_df, {lr.regParam:0.0})
The problem is that newer versions of spark have a Vector class in linalg module of ml and you do not need to get it from mllib.linalg. Also the newer versions do not accept spark.mllib.linalg.VectorUDT in ml. here is the code that would work for you :
from pyspark import SparkContext
from pyspark.ml.regression import LinearRegression
from pyspark.ml.linalg import Vectors
import numpy as np
data = sc.textFile("/FileStore/tables/w7baik1x1487076820914/randomTableSmall.csv")
def parsePoint(line):
values = [float(x) for x in line.split(',')]
return (values[1], Vectors.dense([values[0]]))
points_df = data.map(parsePoint).toDF(['label','features'])
lr = LinearRegression()
model = lr.fit(points_df)
Spark newer versions don't accept spark.mllib.linalg.VectorUDT (you do not need to get it from mllib.linalg).
try to replace
from pyspark.mllib.regression import LabeledPoint
by:
from pyspark.ml.linalg import Vectors
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With