Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering text documents using scikit-learn kmeans in Python

I need to implement scikit-learn's kMeans for clustering text documents. The example code works fine as it is but takes some 20newsgroups data as input. I want to use the same code for clustering a list of documents as shown below:

documents = ["Human machine interface for lab abc computer applications",              "A survey of user opinion of computer system response time",              "The EPS user interface management system",              "System and human system engineering testing of EPS",              "Relation of user perceived response time to error measurement",              "The generation of random binary unordered trees",              "The intersection graph of paths in trees",              "Graph minors IV Widths of trees and well quasi ordering",              "Graph minors A survey"] 

What changes do i need to do in kMeans example code to use this list as input? (Simply taking 'dataset = documents' doesn't work)

like image 738
Nabila Shahid Avatar asked Jan 11 '15 17:01

Nabila Shahid


People also ask

How do you cluster text data in Python?

The best way to begin is to use the unique() method on your column in your pandas dataframe as below — s3 is my column name. The input is a list of string-type objects. The full documentation can be seen here. From here we can use K-means to cluster our text.

Can K-means be used for text clustering?

K-means clustering is a type of unsupervised learning method, which is used when we don't have labeled data as in our case, we have unlabeled data (means, without defined categories or groups). The goal of this algorithm is to find groups in the data, whereas the no. of groups is represented by the variable K.

Can K-means be used for categorization of text data?

K-means is classical algorithm for data clustering in text mining, but it is seldom used for feature selection. For text data, the words that can express correct semantic in a class are usually good features.

How you can implement k-means clustering using Scikit-learn?

K-means clustering using scikit-learnWe set n_init=10 to run the k-means clustering algorithms 10 times independently with different random centroids to choose the final model as the one with the lowest SSE. Via the max_iter parameter, we specify the maximum number of iterations for each single run (here, 300 ).


1 Answers

This is a simpler example:

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import KMeans from sklearn.metrics import adjusted_rand_score  documents = ["Human machine interface for lab abc computer applications",              "A survey of user opinion of computer system response time",              "The EPS user interface management system",              "System and human system engineering testing of EPS",              "Relation of user perceived response time to error measurement",              "The generation of random binary unordered trees",              "The intersection graph of paths in trees",              "Graph minors IV Widths of trees and well quasi ordering",              "Graph minors A survey"] 

vectorize the text i.e. convert the strings to numeric features

vectorizer = TfidfVectorizer(stop_words='english') X = vectorizer.fit_transform(documents) 

cluster documents

true_k = 2 model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1) model.fit(X) 

print top terms per cluster clusters

print("Top terms per cluster:") order_centroids = model.cluster_centers_.argsort()[:, ::-1] terms = vectorizer.get_feature_names() for i in range(true_k):     print "Cluster %d:" % i,     for ind in order_centroids[i, :10]:         print ' %s' % terms[ind],     print 

If you want to have a more visual idea of how this looks like see this answer.

like image 130
elyase Avatar answered Sep 22 '22 09:09

elyase