Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to create similarity matrix in numpy python?

I have data in a file in following form:

user_id, item_id, rating
1, abc,5
1, abcd,3
2, abc, 3
2, fgh, 5

So, the matrix I want to form for above data is following:

#   itemd_ids
# abc  abcd  fgh
[[5,    3,    0]  # user_id 1
 [3,    0,    5]] # user_id 2

where missing data is replaced by 0.

But from this I want to create both user to user similarity matrix and item to item similarity matrix?

How do I do that?

like image 931
frazman Avatar asked Aug 25 '13 19:08

frazman


People also ask

How to compute cosine similarity matrix of two NumPy arrays?

How to compute cosine similarity matrix of two numpy array? We will create a function to implement it. Here is an example: def cos_sim_2d(x, y): norm_x = x / np.linalg.norm(x, axis=1, keepdims=True) norm_y = y / np.linalg.norm(y, axis=1, keepdims=True) return np.matmul(norm_x, norm_y.T)

How to create a matrix in NumPy?

Matrix is a two-dimensional array. In numpy, you can create two-dimensional arrays using the array () method with the two or more arrays separated by the comma. You can read more about matrix in details on Matrix Mathematics. How to create a matrix in a Numpy?

How to create two-dimensional arrays in NumPy?

In numpy, you can create two-dimensional arrays using the array() method with the two or more arrays separated by the comma. You can read more about matrix in details on Matrix Mathematics.

How to create and initialize a matrix in Python?

To create and initialize a matrix in python, there are several solutions, some commons examples using the python module numpy: it is then useful to add an axis to the matrix A using np.newaxis ( ref ): To create a matrix containing only 0, a solution is to use the numpy function zeros


2 Answers

Technically, this is not a programming problem but a math problem. But I think you better off using variance-covariance matrix. Or correlation matrix, if the scale of the values are very different, say, instead of having:

>>> x
array([[5, 3, 0],
       [3, 0, 5],
       [5, 5, 0],
       [1, 1, 7]])

You have:

>>> x
array([[5, 300, 0],
       [3, 0, 5],
       [5, 500, 0],
       [1, 100, 7]])

To get a variance-cov matrix:

>>> np.cov(x)
array([[  6.33333333,  -3.16666667,   6.66666667,  -8.        ],
       [ -3.16666667,   6.33333333,  -5.83333333,   7.        ],
       [  6.66666667,  -5.83333333,   8.33333333, -10.        ],
       [ -8.        ,   7.        , -10.        ,  12.        ]])

Or the correlation matrix:

>>> np.corrcoef(x)
array([[ 1.        , -0.5       ,  0.91766294, -0.91766294],
       [-0.5       ,  1.        , -0.80295507,  0.80295507],
       [ 0.91766294, -0.80295507,  1.        , -1.        ],
       [-0.91766294,  0.80295507, -1.        ,  1.        ]])

This is the way to look at it, the diagonal cell, i.e., (0,0) cell, is the correlation of your 1st vector in X to it self, so it is 1. The other cells, i.e, (0,1) cell, is the correlation between the 1st and 2nd vector in X. They are negatively correlated. Or similarly, the 1st and 3rd cell are positively correlated.

covariance matrix or correlation matrix avoid the zero problem pointed out by @Akavall.

like image 93
CT Zhu Avatar answered Nov 14 '22 21:11

CT Zhu


See this question: What's the fastest way in Python to calculate cosine similarity given sparse matrix data?

Having:

A = np.array(
[[0, 1, 0, 0, 1],
[0, 0, 1, 1, 1],
[1, 1, 0, 1, 0]])

dist_out = 1-pairwise_distances(A, metric="cosine")
dist_out

Result in:

array([[ 1.        ,  0.40824829,  0.40824829],
       [ 0.40824829,  1.        ,  0.33333333],
       [ 0.40824829,  0.33333333,  1.        ]])

But that works for dense matrix. For sparse you have to develop your solution.

like image 24
Medeiros Avatar answered Nov 14 '22 22:11

Medeiros