I'm trying to perform hierarchical clustering on large sparse observation matrix. The matrix represents movie ratings for a number of users. My goal is to cluster similar users based on their movie preferences. However, I need a dendrogram, rather than single division. In order to do this, I tried to use SciPy:
R = dok_matrix((nrows, ncols), dtype=np.float32)
for user in ratings:
for item in ratings[user]:
R[item, user] = ratings[user][item]
Z = hierarchy.linkage(R.transpose().toarray(), method='ward')
This works fine on a small data-set:
However, I (obviously) get memory problems when scaling up. If there any way I can feed sparse matrix to the algorithm?
From scipy/cluster/hierarchy.py
linkage
processes the y
argument as:
y = _convert_to_double(np.asarray(y, order='c'))
if y.ndim == 1:
distance.is_valid_y(y, throw=True, name='y')
[y] = _copy_arrays_if_base_present([y])
elif y.ndim == 2:
if method in _EUCLIDEAN_METHODS and metric != 'euclidean':
raise ValueError("Method '{0}' requires the distance metric "
"to be Euclidean".format(method))
y = distance.pdist(y, metric)
else:
raise ValueError("`y` must be 1 or 2 dimensional.")
When I apply asarray
to a dok
I get a 0d object array. It just wraps the dictionary in an array.
In [905]: M=sparse.dok_matrix([[1,0,0,2,3],[0,0,0,0,1]])
In [906]: M
Out[906]:
<2x5 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements in Dictionary Of Keys format>
In [908]: m = np.asarray(M)
In [909]: m
Out[909]:
array(<2x5 sparse matrix of type '<class 'numpy.int32'>'
with 4 stored elements in Dictionary Of Keys format>, dtype=object)
In [910]: m.shape
Out[910]: ()
linkage
accepts a 1d compressed style distance matrix, or the equivalent 2d one.
Looking further in linkage
I deduce that ward
uses nn_chain
, which is in the compiled scipy/cluster/_hierarchy.cpython-35m-i386-linux-gnu.so
file. That puts the working part of the method even further out of reach of the casual Python programmer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With