I have encountered the "all-pairs similarity" problem in my recommendation system. Thanks to this databricks blog, it seems RowMatrix may come to help.
However, RowMatrix is a matrix type without meaningful row indices, thereby I don't know how to retrieve the similarity result after invoking columnSimilarities(threshold)
for specific item i and j
Below is some details about what I am doing:
1) My data file comes from Movielens with format like this:
user::item::rating
2) I build up a RowMatrix in which each sparse vector i represents the ratings of all users to this item i
val dataPath = ...
val ratings: RDD[Rating] = sc.textFile(dataPath).map(_.split("::") match {
case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble)
})
val rows = ratings.map(rating=>(rating.product, (rating.user, rating.rating)))
.groupByKey()
.map(p => Vectors.sparse(userAmount, p._2.map(r=>(r._1-1, r._2)).toSeq))
val mat = new RowMatrix(rows)
val similarities = mat.columnSimilarities(0.5)
Now I get a CoordinateMatrix similarities
. How can I get the similarity of specific item i and j? Although it can be used to retrieve a RDD[MatrixEntry]
, I am not sure whether the row i and column j correspond to item i and j.
I have encountered the same problem as you and solved it as follows.
you should note that columnSimilarities() is to call the similarity of column vectors. However, our "rows" is always composed of row vectors. So you should get the transpose of the "rows", let's suppose that is "tran_rows". Then calculate tran_rows.columnSimilarities()
thing is easy then. In the result of columnSimilarities(), the index i and j exactly correspond to item i and item j.
If threshold is not so desirable in your case, you can use columnSimilarities on IndexedRowMatrix. That works for me very well. In this way, you have a better way to manage the row indices.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With