Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Type conversion error from LabeledPoint in pyspark.mllib, for using linear regression model in pyspark.ml

I have the following code for linear regression using pyspark.ml package. However I get this error message for the last line, when the model is being fit:

IllegalArgumentException: u'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.

Does anyone has an idea what is missing? Is there any replacement in pyspark.ml for LabeledPoint in pyspark.mllib?

from pyspark import SparkContext
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.regression import LabeledPoint
import numpy as np
from pandas import *


data = sc.textFile("/FileStore/tables/w7baik1x1487076820914/randomTableSmall.csv")

def parsePoint(line):
    values = [float(x) for x in line.split(',')]
    return LabeledPoint(values[1], [values[0]])


points_df = data.map(parsePoint).toDF()

lr = LinearRegression()

model = lr.fit(points_df, {lr.regParam:0.0})
like image 316
Hamed Avatar asked Feb 14 '17 16:02

Hamed


2 Answers

The problem is that newer versions of spark have a Vector class in linalg module of ml and you do not need to get it from mllib.linalg. Also the newer versions do not accept spark.mllib.linalg.VectorUDT in ml. here is the code that would work for you :

from pyspark import SparkContext
from pyspark.ml.regression import LinearRegression
from pyspark.ml.linalg import Vectors
import numpy as np


data = sc.textFile("/FileStore/tables/w7baik1x1487076820914/randomTableSmall.csv")

def parsePoint(line):
    values = [float(x) for x in line.split(',')]
    return (values[1], Vectors.dense([values[0]]))


points_df = data.map(parsePoint).toDF(['label','features'])

lr = LinearRegression()

model = lr.fit(points_df)
like image 181
Gaurav Dhama Avatar answered Nov 13 '22 11:11

Gaurav Dhama


Spark newer versions don't accept spark.mllib.linalg.VectorUDT (you do not need to get it from mllib.linalg).

try to replace

from pyspark.mllib.regression import LabeledPoint

by:

from pyspark.ml.linalg import Vectors

like image 1
HISI Avatar answered Nov 13 '22 11:11

HISI