Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Recommendation system with matrix factorization for huge data gives MemoryError

I have three DB models (from Django) that can be used as the input for building a recommendation system:

  • Users List - with userId, username, email etc
  • Movies List - with movieId, movieTitle, Topics etc
  • Saves List - with userId, movieId and timestamp (the current recommendation system will be a little bit simpler than the usual approaches found online in the sense that there is no rating score, just the fact that the user has saved a certain movie, and this model contains all the movie saves)

I should still be able to use matrix factorization (MF) for building a recommendation system, even though the rating of a certain item will just be in the form of 1 and 0 (saved or not saved).

In order to use all the MF algorithms found in either scipy or surprise, I have to create a pandas DataFrame and pivot it such that all userIds will be the rows (indexes) and all movieIds will be the columns.

A snippet code for doing this is:

# usersSet and moviesSet contain only ids of users or movies

zeros = numpy.zeros(shape=(len(usersSet), len(moviesSet)), dtype=numpy.int8)

saves_df = pandas.DataFrame(zeros, index=list(usersSet), columns=list(moviesSet))

for save in savesFromDb.iterator(chunk_size=50000):
    userId = save['user__id']
    movieId = save['movie__id']

    saves_df.at[userId, movieId] = 1

Problems so far:

  • using DataFrame.loc from pandas to assign values to multiple columns instead of DataFrame.at gives MemoryError. This is why I went for the latter method.
  • using svds from scipy for MF requires floats or doubles as the values of the DataFrame, and as soon as I do DataFrame.asfptype() I get a MemoryError

Questions:

  1. Given that there are ~100k users, ~120k movies and ~450k saves, what's the best approach to model this in order to use recommendation algorithms but still not get MemoryError?
  2. I also tried using DataFrame.pivot(), but is there a way to build it from 3 different DataFrames? i.e. indexes will be from list(usersSet), columns from list(moviesList) and values by iterating over savesFromDb and seeing where there is a userId -> movieId relationship and adding 1 in the pivot.
  3. Aside from surprise's rating_scale parameter where you can define the rating (in my case would be (0, 1)), is there any other way in terms of algorithm approach or data model structure to leverage the fact that the rating in my case is only 1 or 0 (saved or not saved)?
like image 920
Vlad Avatar asked Aug 06 '19 06:08

Vlad


People also ask

How does matrix factorization work in recommender systems?

Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices.

What are the outcomes after matrix factorization?

Mathematic concept of matrix factorization Given with the input of two matrics matrices P (|U|*k) and Q (|D|*k), it would generate the product result R. Matrix P represents the association between a user and the features while matrix Q represents the association between an item and the features.

What is matrix factorization and where is it used in machine learning?

Where is Matrix Factorization used? Once an individual raises a query on a search engine, the machine deploys uses matrix factorization to generate an output in the form of recommendations. The system uses two approaches– content-based filtering and collaborative filtering- to make recommendations.

Does matrix Estimation provide personalized recommendations?

In return, the collaborative filtering system provides useful personalized recommendations for new items.


1 Answers

If there is an option to use sparse matrices with algorithms that accept them, then I would highly recommend using sparse matrices to get rid of memory issues. scipy.linalg.svds works on scipy sparse matrices.

This is how you can go about creating sparse matrices for your case:

Let's say we have 3 users('a', 'b', 'c') and 3 movies ('aa', 'bb', 'cc'). The save history is as follows:

  • a saves aa

  • b saves bb

  • c saves cc

  • a saves bb

We need to create a csr_matrix A_sparse, such that users represents rows, movies columns and if a user i has saved movie j, then A[i, j] = 1

import numpy as np
from scipy.sparse import csr_matrix

# index users and movies by integers
user2int = {u:i for i, u in enumerate(np.unique(users))}
movies2int = {m:i for i, m in enumerate(np.unique(movies))}

# get saved user list and corresponding movie lists
saved_users = ["a", "b", "c", "a"]
saved_movies = ["aa", "bb", "cc", "bb"]

# get row and column indices where we need populate 1's
usersidx = [user2int[u] for u in saved_users]
moviesidx = [movies2int[m] for m in saved_movies]

# Here, we only use binary flag for data. 1 for all saved instances.
# Instead, one could also use something like count of saves etc.
data = np.ones(len(saved_users), ) 

# create csr matrix
A_sparse = csr_matrix((data, (usersidx, moviesidx)))

print("Sparse array", A_sparse)

#<3x3 sparse matrix of type '<class 'numpy.float64'>'
#   with 4 stored elements in Compressed Sparse Row format>

print(A_sparse.data.nbytes)
# 32

print("Dense array", A_sparse.A)

#array([[1., 1., 0.],
#       [0., 1., 0.],
#       [0., 0., 1.]])

print(A_sparse.A.nbytes)
# 72

You can note that, since half of our data points are zeros (approximately), sparse matrix size is almost half of numpy ndarray. Thus, memory compression will increase in proportion of percent of zeros in your matrix.

like image 177
Mohsin hasan Avatar answered Oct 18 '22 13:10

Mohsin hasan