Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

spark-How can I retrieve item-pair after calculating similarity using RowMatrix

I have encountered the "all-pairs similarity" problem in my recommendation system. Thanks to this databricks blog, it seems RowMatrix may come to help.

However, RowMatrix is a matrix type without meaningful row indices, thereby I don't know how to retrieve the similarity result after invoking columnSimilarities(threshold) for specific item i and j

Below is some details about what I am doing:

1) My data file comes from Movielens with format like this:

user::item::rating

2) I build up a RowMatrix in which each sparse vector i represents the ratings of all users to this item i

val dataPath = ...
val ratings: RDD[Rating] = sc.textFile(dataPath).map(_.split("::") match { 
  case Array(user, item, rate) => Rating(user.toInt, item.toInt, rate.toDouble)
})
val rows = ratings.map(rating=>(rating.product, (rating.user, rating.rating)))
  .groupByKey()
  .map(p => Vectors.sparse(userAmount, p._2.map(r=>(r._1-1, r._2)).toSeq))

val mat = new RowMatrix(rows)

val similarities = mat.columnSimilarities(0.5)

Now I get a CoordinateMatrix similarities. How can I get the similarity of specific item i and j? Although it can be used to retrieve a RDD[MatrixEntry], I am not sure whether the row i and column j correspond to item i and j.

like image 626
Eric Zheng Avatar asked Apr 25 '15 02:04

Eric Zheng


2 Answers

I have encountered the same problem as you and solved it as follows.

  1. you should note that columnSimilarities() is to call the similarity of column vectors. However, our "rows" is always composed of row vectors. So you should get the transpose of the "rows", let's suppose that is "tran_rows". Then calculate tran_rows.columnSimilarities()

  2. thing is easy then. In the result of columnSimilarities(), the index i and j exactly correspond to item i and item j.

like image 159
Echo Avatar answered Nov 11 '22 01:11

Echo


If threshold is not so desirable in your case, you can use columnSimilarities on IndexedRowMatrix. That works for me very well. In this way, you have a better way to manage the row indices.

like image 20
Dung Avatar answered Nov 11 '22 02:11

Dung