Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using predict on new text with kmeans (sklearn)?

I have a very small list of short strings which I want to (1) cluster and (2) use that model to predict which cluster a new string belongs to.

Running the first part works fine, getting a prediction for the new string does not.

First Part

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# List of 
documents_lst = ['a small, narrow river',
                'a continuous flow of liquid, air, or gas',
                'a continuous flow of data or instructions, typically one having a constant or predictable rate.',
                'a group in which schoolchildren of the same age and ability are taught',
                '(of liquid, air, gas, etc.) run or flow in a continuous current in a specified direction',
                'transmit or receive (data, especially video and audio material) over the Internet as a steady, continuous flow.',
                'put (schoolchildren) in groups of the same age and ability to be taught together',
                'a natural body of running water flowing on or under the earth']


# 1. Vectorize the text
tfidf_vectorizer  = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents_lst)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)

# 2. Get the number of clusters to make .. (find a better way than random)
num_clusters = 3

# 3. Cluster the defintions
km = KMeans(n_clusters=num_clusters, init='k-means++').fit(tfidf_matrix)

clusters = km.labels_.tolist()

print(clusters)

Which returns:

tfidf_matrix.shape:  (8, 39)
[0, 1, 0, 2, 1, 0, 2, 0]

Second Part

The failing part:

predict_doc = ['A stream is a body of water with a current, confined within a bed and banks.']

tfidf_vectorizer  = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(predict_doc)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)

km.predict(tfidf_matrix)

The error:

ValueError: Incorrect number of features. Got 7 features, expected 39

FWIW: I somewhat understand that the training and predict have a different amount of features after vectorizing ...

I am open to any solution including changing from kmeans to an algorithm more suitable for short text clustering.

Thanks in advance

like image 525
Itay Livni Avatar asked Mar 16 '17 05:03

Itay Livni


People also ask

What does the predict () function of the Sklearn KMeans?

The purpose of . predict() or . transform() is to apply a trained model to data. If you want to fit the model and apply it to the same data during training, there are .

Can I use KMeans to predict?

Yes you can use k-means to predict clusters.

How do I use KMeans in Python Sklearn?

Step-1:To decide the number of clusters, we select an appropriate value of K. Step-2: Now choose random K points/centroids. Step-3: Each data point will be assigned to its nearest centroid and this will form a predefined cluster. Step-4: Now we shall calculate variance and position a new centroid for every cluster.


1 Answers

For completeness I will answer my own question with an answer from here , that doesn't answer that question. But answers mine

from sklearn.cluster import KMeans

list1 = ["My name is xyz", "My name is pqr", "I work in abc"]
list2 = ["My name is xyz", "I work in abc"]

vectorizer = TfidfVectorizer(min_df = 0, max_df=0.5, stop_words = "english", charset_error = "ignore", ngram_range = (1,3))
vec = vectorizer.fit(list1)   # train vec using list1
vectorized = vec.transform(list1)   # transform list1 using vec

km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=1000, tol=0.0001, precompute_distances=True, verbose=0, random_state=None, n_jobs=1)

km.fit(vectorized)
list2Vec = vec.transform(list2)  # transform list2 using vec
km.predict(list2Vec)

The credit goes to @IrshadBhat

like image 111
Itay Livni Avatar answered Oct 13 '22 23:10

Itay Livni