scikit-learn: clustering text documents using DBSCAN

Tags:

I'm tryin to use scikit-learn to cluster text documents. On the whole, I find my way around, but I have my problems with specific issues. Most of the examples I found illustrate clustering using scikit-learn with k-means as clustering algorithm. Adopting these example with k-means to my setting works in principle. However, k-means is not suitable since I don't know the number of clusters. From what I read so far -- please correct me here if needed -- DBSCAN or MeanShift seem the be more appropriate in my case. The scikit-learn website provides examples for each cluster algorithm. The problem is now, that with both DBSCAN and MeanShift I get errors I cannot comprehend, let alone solve.

My minimal code is as follows:

docs = []
for item in [database]:
    docs.append(item)

vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(docs)

X = X.todense() # <-- This line was needed to resolve the isse

db = DBSCAN(eps=0.3, min_samples=10).fit(X)
...

(My documents are already processed, i.e., stopwords have been removed and an Porter Stemmer has been applied.)

When I run this code, I get the following error when instatiating DBSCAN and calling fit():

...
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 248, in fit
clust = dbscan(X, **self.get_params())
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/dbscan_.py", line 86, in dbscan
n = X.shape[0]
IndexError: tuple index out of range

Clicking on the line in dbscan_.py that throws the error, I noticed the following line

...
X = np.asarray(X)
n = X.shape[0]
...

When I use these to lines directly in my code for testing, I get the same error. I don't really know what np.asarray(X) is doing here, but after the command X.shape = (). Hence X.shape[0] bombs -- before, X.shape[0] correctly refers to the number of documents. Out of curiosity, I removed X = np.asarray(X) from dbscan_.py. When I do this, something is computing heavily. But after some seconds, I get another error:

...
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 214, in extractor
(min_indx,max_indx) = check_bounds(indices,N)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 198, in check_bounds
max_indx = indices.max()
File "/usr/lib/python2.7/dist-packages/numpy/core/_methods.py", line 17, in _amax
out=out, keepdims=keepdims)
ValueError: zero-size array to reduction operation maximum which has no identity

In short, I have no clue how to get DBSCAN working, or what I might have missed, in general.

264

asked Aug 09 '14 09:08

Christian

2 Answers

It looks like sparse representations for DBSCAN are supported as of Jan. 2015.

I upgraded sklearn to 0.16.1 and it worked for me on text.

answered Oct 14 '22 08:10

rump roast

The implementation in sklearn seems to assume you are dealing with a finite vector space, and wants to find the dimensionality of your data set. Text data is commonly represented as sparse vectors, but now with the same dimensionality.

Your input data probably isn't a data matrix, but the sklearn implementations needs them to be one.

You'll need to find a different implementation. Maybe try the implementation in ELKI, which is very fast, and should not have this limitation.

You'll need to spend some time in understanding similarity first. For DBSCAN, you must choose epsilon in a way that makes sense for your data. There is no rule of thumb; this is domain specific. Therefore, you first need to figure out which similarity threshold means that two documents are similar.

Mean Shift may actually need your data to be vector space of fixed dimensionality.

answered Oct 14 '22 09:10

Has QUIT--Anony-Mousse

Related questions
                            
                                Supervised learning with multiple sources of training data
                            
                                Length normalization in a naive Bayes classifier for documents
                            
                                How to retrieve class values from WEKA using MATLAB
                            
                                Code generation with Machine learning [closed]
                            
                                What would be a good application for an enhanced version of MapReduce that shares information between Mappers?
                            
                                LIBLINEAR/LIBSVM "Wrong input format at line 1"
                            
                                How to decode speech input
                            
                                How to get out of 'sticky' states? [closed]
                            
                                does mallet have a GUI?
                            
                                machine learning in Python to play checkers? [closed]
                            
                                How to predict a continuous value (time) from text documents? [closed]
                            
                                How do I plot for Multiple Linear Regression Model using matplotlib
                            
                                Sklearn Transformers: How to apply encoder to multiple columns and reuse it in production?
                            
                                True Positive Rate and False Positive Rate (TPR, FPR) for Multi-Class Data in python [duplicate]
                            
                                Using Artificial Intelligence (AI) to predict Stock Prices
                            
                                Implementing sparse connections in neural network (Theano)
                            
                                tensorflow:Your input ran out of data
                            
                                How to construct a network with two inputs in PyTorch
                            
                                How can I use sklearn.naive_bayes with (multiple) categorical features? [closed]
                            
                                How to calculate TF*IDF for a single new document to be classified?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

scikit-learn: clustering text documents using DBSCAN

Tags:

machine-learning

cluster-analysis

scikit-learn

data-mining

dbscan

Christian

People also ask

2 Answers

rump roast

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us