Type conversion error from LabeledPoint in pyspark.mllib, for using linear regression model in pyspark.ml

Question

I have the following code for linear regression using pyspark.ml package. However I get this error message for the last line, when the model is being fit:

IllegalArgumentException: u'requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7 but was actually org.apache.spark.mllib.linalg.VectorUDT@f71b0bce.

Does anyone has an idea what is missing? Is there any replacement in pyspark.ml for LabeledPoint in pyspark.mllib?

from pyspark import SparkContext
from pyspark.ml.regression import LinearRegression
from pyspark.mllib.regression import LabeledPoint
import numpy as np
from pandas import *


data = sc.textFile("/FileStore/tables/w7baik1x1487076820914/randomTableSmall.csv")

def parsePoint(line):
    values = [float(x) for x in line.split(',')]
    return LabeledPoint(values[1], [values[0]])


points_df = data.map(parsePoint).toDF()

lr = LinearRegression()

model = lr.fit(points_df, {lr.regParam:0.0})

Gaurav Dhama · Accepted Answer

The problem is that newer versions of spark have a Vector class in linalg module of ml and you do not need to get it from mllib.linalg. Also the newer versions do not accept spark.mllib.linalg.VectorUDT in ml. here is the code that would work for you :

from pyspark import SparkContext
from pyspark.ml.regression import LinearRegression
from pyspark.ml.linalg import Vectors
import numpy as np


data = sc.textFile("/FileStore/tables/w7baik1x1487076820914/randomTableSmall.csv")

def parsePoint(line):
    values = [float(x) for x in line.split(',')]
    return (values[1], Vectors.dense([values[0]]))


points_df = data.map(parsePoint).toDF(['label','features'])

lr = LinearRegression()

model = lr.fit(points_df)

HISI · Answer

Spark newer versions don't accept spark.mllib.linalg.VectorUDT (you do not need to get it from mllib.linalg).

try to replace

from pyspark.mllib.regression import LabeledPoint

by:

from pyspark.ml.linalg import Vectors

Type conversion error from LabeledPoint in pyspark.mllib, for using linear regression model in pyspark.ml

Tags:

linear-regression

pyspark

Hamed

2 Answers

Gaurav Dhama

HISI

Recent Activity

Donate For Us

Type conversion error from LabeledPoint in pyspark.mllib, for using linear regression model in pyspark.ml

Tags:

linear-regression

pyspark

Hamed

2 Answers

Gaurav Dhama

HISI

Related questions

Recent Activity

Donate For Us