Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark - How to create a sparse matrix from item ratings

My question is equivalent to R-related post Create Sparse Matrix from a data frame, except that I would like to perform the same thing on Spark (preferably in Scala).

Sample of data in the data.txt file from which the sparse matrix is being created:

UserID MovieID  Rating
2      1       1
3      2       1
4      2       1
6      2       1
7      2       1

So in the end the columns are the movie IDs and the rows are the user IDs

    1   2   3   4   5   6   7
1   0   0   0   0   0   0   0
2   1   0   0   0   0   0   0
3   0   1   0   0   0   0   0
4   0   1   0   0   0   0   0
5   0   0   0   0   0   0   0
6   0   1   0   0   0   0   0
7   0   1   0   0   0   0   0

I've actually started by doing a map RDD transformation on the data.txt file (without the headers) to convert values into Integer, but then ... I could not find a function for sparse matrix creation.

val data = sc.textFile("/data/data.txt")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
    Rating(user.toInt, item.toInt, rate.toInt)
  })
...?
like image 952
guzu92 Avatar asked Sep 04 '15 16:09

guzu92


1 Answers

The simplest way is to map Ratings to MatrixEntries an create CoordinateMatrix:

import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry}

val mat = new CoordinateMatrix(ratings.map {
    case Rating(user, movie, rating) => MatrixEntry(user, movie, rating)
})

CoordinateMatrix can be further converted to BlockMatrix, IndexedRowMatrix, RowMatrix using toBlockMatrix, toIndexedRowMatrix, toRowMatrix respectively.

like image 160
zero323 Avatar answered Sep 28 '22 06:09

zero323