I was trying to predict a label for every row in a DataFrame
, but without using the LinearRegressionModel's transform
method, due to ulterior motives, instead I was trying to compute it manually by using the classic formula 1 / (1 + e^(-hθ(x)))
, note that I copied the code from Apache Spark
's repository and copied almost everything from private
object BLAS
into a public version of it.
PD: I don't use any regParam
, I just fitted the model.
//Notice that I had to obtain intercept, and coefficients from my model
val intercept = model.intercept
val coefficients = model.coefficients
val margin: Vector => Double = (features) => {
BLAS.dot(features, coefficients) + intercept
}
val score: Vector => Double = (features) => {
val m = margin(features)
1.0 / (1.0 + math.exp(-m))
}
After defining such functions, and obtaining model's parameters I created a UDF
to compute the prediction (it receives the same features as a DenseVector
), later I compare my preductions to real model's ones and they are very different! So what did I miss? What am I doing wrong?
val predict = udf((v: DenseVector) => {
val recency = v(0)
val frequency = v(1)
val tp = score(new DenseVector(Array(recency, frequency)))
new DenseVector(Array(tp, 1 - tp))
})
// model's predictions
val xf = model.transform(df)
df.select(col("id"), predict(col("features")).as("myprediction"))
.join(xf, df("id") === xf("id"), "inner")
.select(df("id"), col("probability"), col("myprediction"))
.show
+----+--------------------+--------------------+
| id| probability| myprediction|
+----+--------------------+--------------------+
| 31|[0.97579780436514...|[0.98855386037790...|
| 231|[0.97579780436514...|[0.98855386037790...|
| 431|[0.69794428333266...| [1.0,0.0]|
| 631|[0.97579780436514...|[0.98855386037790...|
| 831|[0.97579780436514...|[0.98855386037790...|
|1031|[0.96509616791398...|[0.99917463322937...|
|1231|[0.96509616791398...|[0.99917463322937...|
|1431|[0.96509616791398...|[0.99917463322937...|
|1631|[0.94231815700848...|[0.99999999999999...|
|1831|[0.96509616791398...|[0.99917463322937...|
|2031|[0.96509616791398...|[0.99917463322937...|
|2231|[0.96509616791398...|[0.99917463322937...|
|2431|[0.95353743438055...| [1.0,0.0]|
|2631|[0.94646924057674...| [1.0,0.0]|
|2831|[0.96509616791398...|[0.99917463322937...|
|3031|[0.96509616791398...|[0.99917463322937...|
|3231|[0.95971207153567...|[0.99999999999996...|
|3431|[0.96509616791398...|[0.99917463322937...|
|3631|[0.96509616791398...|[0.99917463322937...|
|3831|[0.96509616791398...|[0.99917463322937...|
+----+--------------------+--------------------+
I even tried defining such functions inside the udf
, and didn't work.
def predict(coefficients: Vector, intercept: Double) = {
udf((v: DenseVector) => {
def margin(features: Vector, coefficients: Vector, intercept: Double): Double = {
BLAS.dot(features, coefficients) + intercept
}
def score(features: Vector, coefficients: Vector, intercept: Double): Double = {
val m = margin(features, coefficients, intercept)
1.0 / (1.0 + math.exp(-m))
}
val recency = v(0)
val frequency = v(1)
val tp = score(new DenseVector(Array(recency, frequency)), coefficients, intercept)
new DenseVector(Array(tp, 1 - tp))
})
}
It's very embarrassing but actually the problem was because I used a Pipeline
and added a MinMaxScaler
as a stage, so the dataset was scaled before the model's training, so both parameters coefficients
and intercept
were tied to that scaled data, so when I computed the prediction using them, the result was totally biased. Therefore, to solve this I just unnormalized the training dataset so I could get those coefficients
and the intercept
. After I re-executed the code, I got the same result as Spark
. On the other hand, I listened to @zero323's and moved margin
and score
definitions to inside the udf
's first lambda
declaration.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With