Trying to build recommendation system using Spark MLLib's ALS.
Currently, we're trying to pre-build recommendations for all users on daily basis. We're using simple implicit feedbacks and ALS.
The problem is, we have 20M users and 30M products, and to call the main predict() method, we need to have the cartesian join for users and products, which is too huge, and it may take days to generate only the join. Is there a way to avoid cartesian join to make the process faster?
Currently we have 8 nodes with 64Gb of RAM, I think it should be enough for the data.
val users: RDD[Int] = ??? // RDD with 20M userIds
val products: RDD[Int] = ??? // RDD with 30M productIds
val ratings : RDD[Rating] = ??? // RDD with all user->product feedbacks
val model = new ALS().setRank(10).setIterations(10)
.setLambda(0.0001).setImplicitPrefs(true)
.setAlpha(40).run(ratings)
val usersProducts = users.cartesian(products)
val recommendations = model.predict(usersProducts)
Not sure if you really need the whole 20M x 30M matrix. In case you just want to pre-build recommendations for products per user, maybe try recommendProducts(user: Int, num: Int)
for all users, limiting yourself to the num
strongest recommendations. There is also recommendUsers()
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With