I'm trying to use scikit-learn's DBSCAN implementation to clusterize a bunch of documents. First I create the TF-IDF matrix using scikit-learn's TfidfVectorizer (it's a 163405x13029 sparse matrix of type numpy.float64). Then I try to clusterize specific subsets of this matrix. Things work fine when the subset is small (say, up to a few thousand rows). But with large subsets (with tens of thousands of rows) I get ValueError: could not convert integer scalar
.
Here's the full traceback (idxs
is a list of indices):
ValueError Traceback (most recent call last)
<ipython-input-1-73ee366d8de5> in <module>()
193 # use descriptions to clusterize items
194 ncm_clusterizer = DBSCAN()
--> 195 ncm_clusterizer.fit_predict(tfidf[idxs])
196 idxs_clusters = list(zip(idxs, ncm_clusterizer.labels_))
197 for e in idxs_clusters:
/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in fit_predict(self, X, y, sample_weight)
294 cluster labels
295 """
--> 296 self.fit(X, sample_weight=sample_weight)
297 return self.labels_
/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in fit(self, X, y, sample_weight)
264 X = check_array(X, accept_sparse='csr')
265 clust = dbscan(X, sample_weight=sample_weight,
--> 266 **self.get_params())
267 self.core_sample_indices_, self.labels_ = clust
268 if len(self.core_sample_indices_):
/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in dbscan(X, eps, min_samples, metric, algorithm, leaf_size, p, sample_weight, n_jobs)
136 # This has worst case O(n^2) memory complexity
137 neighborhoods = neighbors_model.radius_neighbors(X, eps,
--> 138 return_distance=False)
139
140 if sample_weight is None:
/usr/local/lib/python3.4/site-packages/sklearn/neighbors/base.py in radius_neighbors(self, X, radius, return_distance)
584 if self.effective_metric_ == 'euclidean':
585 dist = pairwise_distances(X, self._fit_X, 'euclidean',
--> 586 n_jobs=self.n_jobs, squared=True)
587 radius *= radius
588 else:
/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
1238 func = partial(distance.cdist, metric=metric, **kwds)
1239
-> 1240 return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1241
1242
/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
1081 if n_jobs == 1:
1082 # Special case to avoid picklability checks in delayed
-> 1083 return func(X, Y, **kwds)
1084
1085 # TODO: in some cases, backend='threading' may be appropriate
/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
243 YY = row_norms(Y, squared=True)[np.newaxis, :]
244
--> 245 distances = safe_sparse_dot(X, Y.T, dense_output=True)
246 distances *= -2
247 distances += XX
/usr/local/lib/python3.4/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
184 ret = a * b
185 if dense_output and hasattr(ret, "toarray"):
--> 186 ret = ret.toarray()
187 return ret
188 else:
/usr/local/lib/python3.4/site-packages/scipy/sparse/compressed.py in toarray(self, order, out)
918 def toarray(self, order=None, out=None):
919 """See the docstring for `spmatrix.toarray`."""
--> 920 return self.tocoo(copy=False).toarray(order=order, out=out)
921
922 ##############################################################
/usr/local/lib/python3.4/site-packages/scipy/sparse/coo.py in toarray(self, order, out)
256 M,N = self.shape
257 coo_todense(M, N, self.nnz, self.row, self.col, self.data,
--> 258 B.ravel('A'), fortran)
259 return B
260
ValueError: could not convert integer scalar
I'm using Python 3.4.3 (on Red Hat), scipy 0.18.1, and scikit-learn 0.18.1.
I tried the monkey patch suggested here but that didn't work.
Googling around I found a bugfix that apparently solved the same problem for other types of sparse matrices (like csr), but not for coo.
I've tried feeding DBSCAN a sparse radius neighborhood graph (instead of a feature matrix), as suggested here, but the same error happens.
I've tried HDBSCAN, but the same error happens.
How can I fix this or bypass this?
Even if the implementation would allow it, DBSCAN
would probably yield bad results on such very high dimensional data (from a statistical point of view, because of the curse of dimensionality).
Instead I would advise you to use the TruncatedSVD
class to reduce the dimensionality of your TF-IDF feature vectors down to 50 or 100 components and then to apply DBSCAN
on the results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With