I'm just starting out with ML and Apache Spark, so I've been trying out Linear Regression based on the Spark examples. I can't seem to generate a proper model for any data except the sample in the example, and the intercept is always 0.0, regardless of the input data.
I've prepared a simple training data-set based on the function:
y = (2*x1) + (3*x2) + 4
i.e. I would expect the intercept to be 4 and the weights to be (2, 3).
If I run the LinearRegressionWithSGD.train(...) on the raw data, the model is:
Model intercept: 0.0, weights: [NaN,NaN]
And the predictions are all NaN:
Features: [1.0,1.0], Predicted: NaN, Actual: 9.0
Features: [1.0,2.0], Predicted: NaN, Actual: 12.0
etc
If I scale the data first, I get:
Model intercept: 0.0, weights: [17.407863391511754,2.463212481736855]
Features: [1.0,1.0], Predicted: 19.871075873248607, Actual: 9.0
Features: [1.0,2.0], Predicted: 22.334288354985464, Actual: 12.0
Features: [1.0,3.0], Predicted: 24.797500836722318, Actual: 15.0
etc
Either I'm doing something wrong, or I don't understand what the output from this model should be, so can anybody suggest where I might be going wrong here?
My code is below:
// Load and parse the dummy data (y, x1, x2) for y = (2*x1) + (3*x2) + 4
// i.e. intercept should be 4, weights (2, 3)?
val data = sc.textFile("data/dummydata.txt")
// LabeledPoint is (label, [features])
val parsedData = data.map { line =>
val parts = line.split(',')
val label = parts(0).toDouble
val features = Array(parts(1), parts(2)) map (_.toDouble)
LabeledPoint(label, Vectors.dense(features))
}
// Scale the features
val scaler = new StandardScaler(withMean = true, withStd = true)
.fit(parsedData.map(x => x.features))
val scaledData = parsedData
.map(x =>
LabeledPoint(x.label,
scaler.transform(Vectors.dense(x.features.toArray))))
// Building the model: SGD = stochastic gradient descent
val numIterations = 1000
val step = 0.2
val model = LinearRegressionWithSGD.train(scaledData, numIterations, step)
println(s">>>> Model intercept: ${model.intercept}, weights: ${model.weights}")`
// Evaluate model on training examples
val valuesAndPreds = scaledData.map { point =>
val prediction = model.predict(point.features)
(point.label, point.features, prediction)
}
// Print out features, actual and predicted values...
valuesAndPreds.take(10).foreach({case (v, f, p) =>
println(s"Features: ${f}, Predicted: ${p}, Actual: ${v}")})
@Noah: Thanks - your advice prompted me to look at this again, and I found some example code here that allows you to generate the intercept and also set other params, such as number of iterations, via the optimiser.
Here's my revised model generation code, which seems to work OK on my dummy data:
// Building the model: SGD = stochastic gradient descent:
// Need to setIntercept = true, and seems only to work with scaled data
val numIterations = 600
val stepSize = 0.1
val algorithm = new LinearRegressionWithSGD()
algorithm.setIntercept(true)
algorithm.optimizer
.setNumIterations(numIterations)
.setStepSize(stepSize)
val model = algorithm.run(scaledData)
It still seems to need scaled data, rather than the raw data, as input, but that's OK for my purposes here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With