Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark ALS recommendations approach

Trying to build recommendation system using Spark MLLib's ALS.

Currently, we're trying to pre-build recommendations for all users on daily basis. We're using simple implicit feedbacks and ALS.

The problem is, we have 20M users and 30M products, and to call the main predict() method, we need to have the cartesian join for users and products, which is too huge, and it may take days to generate only the join. Is there a way to avoid cartesian join to make the process faster?

Currently we have 8 nodes with 64Gb of RAM, I think it should be enough for the data.

val users: RDD[Int] = ???           // RDD with 20M userIds
val products: RDD[Int] = ???        // RDD with 30M productIds
val ratings : RDD[Rating] = ???     // RDD with all user->product feedbacks

val model = new ALS().setRank(10).setIterations(10)
  .setLambda(0.0001).setImplicitPrefs(true)
  .setAlpha(40).run(ratings)

val usersProducts = users.cartesian(products)
val recommendations = model.predict(usersProducts)
like image 677
Aram Mkrtchyan Avatar asked Mar 18 '15 10:03

Aram Mkrtchyan


1 Answers

Not sure if you really need the whole 20M x 30M matrix. In case you just want to pre-build recommendations for products per user, maybe try recommendProducts(user: Int, num: Int) for all users, limiting yourself to the num strongest recommendations. There is also recommendUsers().

like image 132
stholzm Avatar answered Oct 20 '22 22:10

stholzm