Given a MatrixFactorizationModel what would be the most efficient way to return the full matrix of user-product predictions (in practice, filtered by some threshold to maintain sparsity)?
Via the current API, once could pass a cartesian product of user-product to the predict function, but it seems to me that this will do a lot of extra processing.
Would accessing the private userFeatures, productFeatures be the correct approach, and if so, is there a good way to take advantage of other aspects of the framework to distribute this computation in an efficient way? Specifically, is there an easy way to do better than multiplying all pairs of userFeature, productFeature "by hand"?
Spark 1.1 has a recommendProducts
method that can be mapped to each user ID. This is better than nothing but not really optimized for recommending to all users.
I would double-check that you really mean to make recommendations for everyone; at scale, this is inherently a big slow operation. Consider predicting for users that have been recently active only.
Otherwise, yes your best bet is to create your own method. The cartesian join of the feature RDDs is probably too slow as it's shuffling so many copies of the feature vectors. Choose the larger of the user / product feature set, and map that. In each worker, hold the other product / user feature set in memory in each worker. If this isn't feasible you can make this more complex and map several times against subsets of the smaller RDD in memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With