spark-How can I retrieve item-pair after calculating similarity using RowMatrix

Question

I have encountered the "all-pairs similarity" problem in my recommendation system. Thanks to this databricks blog, it seems RowMatrix may come to help.

However, RowMatrix is a matrix type without meaningful row indices, thereby I don't know how to retrieve the similarity result after invoking columnSimilarities(threshold) for specific item i and j

Below is some details about what I am doing:

1) My data file comes from Movielens with format like this:

user::item::rating

2) I build up a RowMatrix in which each sparse vector i represents the ratings of all users to this item i

val dataPath = ...
val ratings: RDD[Rating] = sc.textFile(dataPath).map(_.split("::") match { 
  case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble)
})
val rows = ratings.map(rating=>(rating.product, (rating.user, rating.rating)))
  .groupByKey()
  .map(p => Vectors.sparse(userAmount, p._2.map(r=>(r._1-1, r._2)).toSeq))

val mat = new RowMatrix(rows)

val similarities = mat.columnSimilarities(0.5)

Now I get a CoordinateMatrix similarities. How can I get the similarity of specific item i and j? Although it can be used to retrieve a RDD[MatrixEntry], I am not sure whether the row i and column j correspond to item i and j.

Echo · Accepted Answer

I have encountered the same problem as you and solved it as follows.

you should note that columnSimilarities() is to call the similarity of column vectors. However, our "rows" is always composed of row vectors. So you should get the transpose of the "rows", let's suppose that is "tran_rows". Then calculate tran_rows.columnSimilarities()
thing is easy then. In the result of columnSimilarities(), the index i and j exactly correspond to item i and item j.

Dung · Answer

If threshold is not so desirable in your case, you can use columnSimilarities on IndexedRowMatrix. That works for me very well. In this way, you have a better way to manage the row indices.

spark-How can I retrieve item-pair after calculating similarity using RowMatrix

Tags:

apache-spark

apache-spark-mllib

Eric Zheng

2 Answers

Echo

Dung

Recent Activity

Donate For Us

spark-How can I retrieve item-pair after calculating similarity using RowMatrix

Tags:

apache-spark

apache-spark-mllib

Eric Zheng

2 Answers

Echo

Dung

Related questions

Recent Activity

Donate For Us