I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields userId
and productId
. I have no product ratings, just info on what products users have bought, that's all. So to train ALS I use:
def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel
(http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$)
This API requires Rating
object:
Rating(user: Int, product: Int, rating: Double)
On the other hand documentation on trainImplicit
tells: Train a matrix factorization model given an RDD of 'implicit preferences' ratings given by users to some products, in the form of (userID, productID, preference) pairs.
When I set rating / preferences to 1
as in:
val ratings = sc.textFile(new File(dir, file).toString).map { line =>
val fields = line.split(",")
// format: (randomNumber, Rating(userId, productId, rating))
(rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0))
}
val training = ratings.filter(x => x._1 < 60)
.values
.repartition(numPartitions)
.cache()
val validation = ratings.filter(x => x._1 >= 60 && x._1 < 80)
.values
.repartition(numPartitions)
.cache()
val test = ratings.filter(x => x._1 >= 80).values.cache()
And then train ALSL:
val model = ALS.trainImplicit(ratings, rank, numIter)
I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1 value:
val validationRmse = computeRmse(model, validation, numValidation)
/** Compute RMSE (Root Mean Squared Error). */
def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product)))
val predictionsAndRatings = predictions.map(x => ((x.user, x.product), x.rating))
.join(data.map(x => ((x.user, x.product), x.rating)))
.values
math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n)
}
So my question is: to what value should I set rating
in:
Rating(user: Int, product: Int, rating: Double)
for implicit training (in ALS.trainImplicit
method) ?
Update
With:
val alpha = 40
val lambda = 0.01
I get:
Got 1895593 ratings from 17471 users on 462685 products.
Training: 1136079, validation: 380495, test: 379019
RMSE (validation) = 0.7537217888106758 for the model trained with rank = 8 and numIter = 10.
RMSE (validation) = 0.7489005441881798 for the model trained with rank = 8 and numIter = 20.
RMSE (validation) = 0.7387672873747732 for the model trained with rank = 12 and numIter = 10.
RMSE (validation) = 0.7310003522283959 for the model trained with rank = 12 and numIter = 20.
The best model was trained with rank = 12, and numIter = 20, and its RMSE on the test set is 0.7302343904091481.
baselineRmse: 0.0 testRmse: 0.7302343904091481
The best model improves the baseline by -Infinity%.
Which is still a big error, I guess. Also I get strange baseline improvement where baseline model is simply mean (1).
Apache Spark ML implements alternating least squares (ALS) for collaborative filtering, a very popular algorithm for making recommendations. ALS recommender is a matrix factorization algorithm that uses Alternating Least Squares with Weighted-Lamda-Regularization (ALS-WR).
Most important hyper-params in Alternating Least Square (ALS): maxIter: the maximum number of iterations to run (defaults to 10) rank: the number of latent factors in the model (defaults to 10) regParam: the regularization parameter in ALS (defaults to 1.0)
lambda specifies the regularization parameter in ALS. implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data. alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations.
Description. The alternating least squares (ALS) algorithm factorizes a given matrix R into two factors U and V such that R≈UTV. The unknown row dimension is given as a parameter to the algorithm and is called latent factors.
You can specify the alpha confidence level. Default is 1.0: but try lower.
val alpha = 0.01
val model = ALS.trainImplicit(ratings, rank, numIterations, alpha)
Let us know how that goes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With