I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields <code>userId</code> and <code>productId</code>. I have no product ratings, just info on what products users have bought, that's all. So to train ALS I use: <pre class="prettyprint"><code>def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel </code></pre> (http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$) This API requires <code>Rating</code> object: <pre class="prettyprint"><code>Rating(user: Int, product: Int, rating: Double) </code></pre> On the other hand documentation on <code>trainImplicit</code> tells: Train a matrix factorization model given an RDD of 'implicit preferences' ratings given by users to some products, in the form of (userID, productID, preference) pairs. When I set rating / preferences to <code>1</code> as in: <pre class="prettyprint"><code>val ratings = sc.textFile(new File(dir, file).toString).map { line => val fields = line.split(",") // format: (randomNumber, Rating(userId, productId, rating)) (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0)) } val training = ratings.filter(x => x._1 < 60) .values .repartition(numPartitions) .cache() val validation = ratings.filter(x => x._1 >= 60 && x._1 < 80) .values .repartition(numPartitions) .cache() val test = ratings.filter(x => x._1 >= 80).values.cache() </code></pre> And then train ALSL: <pre class="prettyprint"><code> val model = ALS.trainImplicit(ratings, rank, numIter) </code></pre> I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1 value: <pre class="prettyprint"><code>val validationRmse = computeRmse(model, validation, numValidation) /** Compute RMSE (Root Mean Squared Error). */ def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = { val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product))) val predictionsAndRatings = predictions.map(x => ((x.user, x.product), x.rating)) .join(data.map(x => ((x.user, x.product), x.rating))) .values math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n) } </code></pre> So my question is: to what value should I set <code>rating</code> in: <pre class="prettyprint"><code>Rating(user: Int, product: Int, rating: Double) </code></pre> for implicit training (in <code>ALS.trainImplicit</code> method) ? Update With: <pre class="prettyprint"><code> val alpha = 40 val lambda = 0.01 </code></pre> I get: <pre class="prettyprint"><code>Got 1895593 ratings from 17471 users on 462685 products. Training: 1136079, validation: 380495, test: 379019 RMSE (validation) = 0.7537217888106758 for the model trained with rank = 8 and numIter = 10. RMSE (validation) = 0.7489005441881798 for the model trained with rank = 8 and numIter = 20. RMSE (validation) = 0.7387672873747732 for the model trained with rank = 12 and numIter = 10. RMSE (validation) = 0.7310003522283959 for the model trained with rank = 12 and numIter = 20. The best model was trained with rank = 12, and numIter = 20, and its RMSE on the test set is 0.7302343904091481. baselineRmse: 0.0 testRmse: 0.7302343904091481 The best model improves the baseline by -Infinity%. </code></pre> Which is still a big error, I guess. Also I get strange baseline improvement where baseline model is simply mean (1).

You can specify the alpha confidence level. Default is 1.0: but try lower. <pre class="prettyprint"><code>val alpha = 0.01 val model = ALS.trainImplicit(ratings, rank, numIterations, alpha) </code></pre> Let us know how that goes.

How to set preferences for ALS implicit feedback in Collaborative Filtering?

Tags:

machine-learning

scala

apache-spark

collaborative-filtering

I am trying to use Spark MLib ALS with implicit feedback for collaborative filtering. Input data has only two fields userId and productId. I have no product ratings, just info on what products users have bought, that's all. So to train ALS I use:

def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int): MatrixFactorizationModel

(http://spark.apache.org/docs/1.0.0/api/scala/index.html#org.apache.spark.mllib.recommendation.ALS$)

This API requires Rating object:

Rating(user: Int, product: Int, rating: Double)

On the other hand documentation on trainImplicit tells: Train a matrix factorization model given an RDD of 'implicit preferences' ratings given by users to some products, in the form of (userID, productID, preference) pairs.

When I set rating / preferences to 1 as in:

val ratings = sc.textFile(new File(dir, file).toString).map { line =>
  val fields = line.split(",")
  // format: (randomNumber, Rating(userId, productId, rating))
  (rnd.nextInt(100), Rating(fields(0).toInt, fields(1).toInt, 1.0))
}

 val training = ratings.filter(x => x._1 < 60)
  .values
  .repartition(numPartitions)
  .cache()
val validation = ratings.filter(x => x._1 >= 60 && x._1 < 80)
  .values
  .repartition(numPartitions)
  .cache()
val test = ratings.filter(x => x._1 >= 80).values.cache()

And then train ALSL:

 val model = ALS.trainImplicit(ratings, rank, numIter)

I get RMSE 0.9, which is a big error in case of preferences taking 0 or 1 value:

val validationRmse = computeRmse(model, validation, numValidation)

/** Compute RMSE (Root Mean Squared Error). */
 def computeRmse(model: MatrixFactorizationModel, data: RDD[Rating], n: Long): Double = {
val predictions: RDD[Rating] = model.predict(data.map(x => (x.user, x.product)))
val predictionsAndRatings = predictions.map(x => ((x.user, x.product), x.rating))
  .join(data.map(x => ((x.user, x.product), x.rating)))
  .values
math.sqrt(predictionsAndRatings.map(x => (x._1 - x._2) * (x._1 - x._2)).reduce(_ + _) / n)
}

So my question is: to what value should I set rating in:

Rating(user: Int, product: Int, rating: Double)

for implicit training (in ALS.trainImplicit method) ?

Update

With:

  val alpha = 40
  val lambda = 0.01

I get:

Got 1895593 ratings from 17471 users on 462685 products.
Training: 1136079, validation: 380495, test: 379019
RMSE (validation) = 0.7537217888106758 for the model trained with rank = 8 and numIter = 10.
RMSE (validation) = 0.7489005441881798 for the model trained with rank = 8 and numIter = 20.
RMSE (validation) = 0.7387672873747732 for the model trained with rank = 12 and numIter = 10.
RMSE (validation) = 0.7310003522283959 for the model trained with rank = 12 and numIter = 20.
The best model was trained with rank = 12, and numIter = 20, and its RMSE on the test set is 0.7302343904091481.
baselineRmse: 0.0 testRmse: 0.7302343904091481
The best model improves the baseline by -Infinity%.

Which is still a big error, I guess. Also I get strange baseline improvement where baseline model is simply mean (1).

508

asked Dec 26 '14 15:12

zork

1 Answers

You can specify the alpha confidence level. Default is 1.0: but try lower.

val alpha = 0.01
val model = ALS.trainImplicit(ratings, rank, numIterations, alpha)

Let us know how that goes.

answered Sep 22 '22 15:09

WestCoastProjects

Related questions
                            
                                Akka Event Bus Tutorial [closed]
                            
                                Performance of Scala for Android
                            
                                Scala: Silently catch all exceptions
                            
                                Scala: Producing the intermediate results of a fold
                            
                                "host not allowed" error when deploying a play framework application to Amazon AWS with Boxfuse
                            
                                Unresolved dependency SBT 0.13.0 after update
                            
                                object xml is not a member of package scala
                            
                                Scala - calculate average of SomeObj.double in a List[SomeObj]
                            
                                Scala regex ignorecase
                            
                                Flatten Scala Try
                            
                                Why I can't execute scala file?
                            
                                Spark textFile vs wholeTextFiles
                            
                                Is there Scala aware high level byte-code manipulation tool like Javassist?
                            
                                Json Serialization for Trait with Multiple Case Classes (Sum Types) in Scala's Play
                            
                                Using private constructor in a macro
                            
                                Parallelize Scala's Iterator
                            
                                OAuth 2.0 provider implementation for Scala/Lift
                            
                                class A has one type parameter, but type B has one
                            
                                What are the important features of the shapeless API (in Scala), and what do they do?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With