I'm trying to use scikit-learn's DBSCAN implementation to clusterize a bunch of documents. First I create the TF-IDF matrix using scikit-learn's TfidfVectorizer (it's a 163405x13029 sparse matrix of type numpy.float64). Then I try to clusterize specific subsets of this matrix. Things work fine when the subset is small (say, up to a few thousand rows). But with large subsets (with tens of thousands of rows) I get ValueError: could not convert integer scalar. 
Here's the full traceback (idxs is a list of indices):
ValueError                        Traceback (most recent call last)
<ipython-input-1-73ee366d8de5> in <module>()
    193     # use descriptions to clusterize items
    194     ncm_clusterizer = DBSCAN()
--> 195     ncm_clusterizer.fit_predict(tfidf[idxs])
    196     idxs_clusters = list(zip(idxs, ncm_clusterizer.labels_))
    197     for e in idxs_clusters:
/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in fit_predict(self, X, y, sample_weight)
    294             cluster labels
    295         """
--> 296         self.fit(X, sample_weight=sample_weight)
    297         return self.labels_
/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in fit(self, X, y, sample_weight)
    264         X = check_array(X, accept_sparse='csr')
    265         clust = dbscan(X, sample_weight=sample_weight,
--> 266                        **self.get_params())
    267         self.core_sample_indices_, self.labels_ = clust
    268         if len(self.core_sample_indices_):
/usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in dbscan(X, eps, min_samples, metric, algorithm, leaf_size, p, sample_weight, n_jobs)
    136         # This has worst case O(n^2) memory complexity
    137         neighborhoods = neighbors_model.radius_neighbors(X, eps,
--> 138                                                          return_distance=False)
    139 
    140     if sample_weight is None:
/usr/local/lib/python3.4/site-packages/sklearn/neighbors/base.py in radius_neighbors(self, X, radius, return_distance)
    584             if self.effective_metric_ == 'euclidean':
    585                 dist = pairwise_distances(X, self._fit_X, 'euclidean',
--> 586                                           n_jobs=self.n_jobs, squared=True)
    587                 radius *= radius
    588             else:
/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds)
   1238         func = partial(distance.cdist, metric=metric, **kwds)
   1239 
-> 1240     return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1241 
   1242 
/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds)
   1081     if n_jobs == 1:
   1082         # Special case to avoid picklability checks in delayed
-> 1083         return func(X, Y, **kwds)
   1084 
   1085     # TODO: in some cases, backend='threading' may be appropriate
/usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
    243         YY = row_norms(Y, squared=True)[np.newaxis, :]
    244 
--> 245     distances = safe_sparse_dot(X, Y.T, dense_output=True)
    246     distances *= -2
    247     distances += XX
/usr/local/lib/python3.4/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output)
    184         ret = a * b
    185         if dense_output and hasattr(ret, "toarray"):
--> 186             ret = ret.toarray()
    187         return ret
    188     else:
/usr/local/lib/python3.4/site-packages/scipy/sparse/compressed.py in toarray(self, order, out)
    918     def toarray(self, order=None, out=None):
    919         """See the docstring for `spmatrix.toarray`."""
--> 920         return self.tocoo(copy=False).toarray(order=order, out=out)
    921 
    922     ##############################################################
/usr/local/lib/python3.4/site-packages/scipy/sparse/coo.py in toarray(self, order, out)
    256         M,N = self.shape
    257         coo_todense(M, N, self.nnz, self.row, self.col, self.data,
--> 258                     B.ravel('A'), fortran)
    259         return B
    260 
ValueError: could not convert integer scalar
I'm using Python 3.4.3 (on Red Hat), scipy 0.18.1, and scikit-learn 0.18.1.
I tried the monkey patch suggested here but that didn't work.
Googling around I found a bugfix that apparently solved the same problem for other types of sparse matrices (like csr), but not for coo.
I've tried feeding DBSCAN a sparse radius neighborhood graph (instead of a feature matrix), as suggested here, but the same error happens.
I've tried HDBSCAN, but the same error happens.
How can I fix this or bypass this?
Even if the implementation would allow it, DBSCAN would probably yield bad results on such very high dimensional data (from a statistical point of view, because of the curse of dimensionality).
Instead I would advise you to use the TruncatedSVD class to reduce the dimensionality of your TF-IDF feature vectors down to 50 or 100 components and then to apply DBSCAN on the results.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With