Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hierarchical clustering on sparse observation matrix

I'm trying to perform hierarchical clustering on large sparse observation matrix. The matrix represents movie ratings for a number of users. My goal is to cluster similar users based on their movie preferences. However, I need a dendrogram, rather than single division. In order to do this, I tried to use SciPy:

R = dok_matrix((nrows, ncols), dtype=np.float32)

for user in ratings:
    for item in ratings[user]:
        R[item, user] = ratings[user][item]

Z = hierarchy.linkage(R.transpose().toarray(), method='ward')

This works fine on a small data-set:

enter image description here

However, I (obviously) get memory problems when scaling up. If there any way I can feed sparse matrix to the algorithm?

like image 686
Siegmeyer Avatar asked Nov 07 '22 20:11

Siegmeyer


1 Answers

From scipy/cluster/hierarchy.py linkage processes the y argument as:

y = _convert_to_double(np.asarray(y, order='c'))

if y.ndim == 1:
    distance.is_valid_y(y, throw=True, name='y')
    [y] = _copy_arrays_if_base_present([y])
elif y.ndim == 2:
    if method in _EUCLIDEAN_METHODS and metric != 'euclidean':
        raise ValueError("Method '{0}' requires the distance metric "
                         "to be Euclidean".format(method))
    y = distance.pdist(y, metric)
else:
    raise ValueError("`y` must be 1 or 2 dimensional.")

When I apply asarray to a dok I get a 0d object array. It just wraps the dictionary in an array.

In [905]: M=sparse.dok_matrix([[1,0,0,2,3],[0,0,0,0,1]])
In [906]: M
Out[906]: 
<2x5 sparse matrix of type '<class 'numpy.int32'>'
    with 4 stored elements in Dictionary Of Keys format>
In [908]: m = np.asarray(M)
In [909]: m
Out[909]: 
array(<2x5 sparse matrix of type '<class 'numpy.int32'>'
    with 4 stored elements in Dictionary Of Keys format>, dtype=object)
In [910]: m.shape
Out[910]: ()

linkage accepts a 1d compressed style distance matrix, or the equivalent 2d one.

Looking further in linkage I deduce that ward uses nn_chain, which is in the compiled scipy/cluster/_hierarchy.cpython-35m-i386-linux-gnu.so file. That puts the working part of the method even further out of reach of the casual Python programmer.

like image 170
hpaulj Avatar answered Nov 14 '22 22:11

hpaulj