Using predict on new text with kmeans (sklearn)?

First Part

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

# List of 
documents_lst = ['a small, narrow river',
                'a continuous flow of liquid, air, or gas',
                'a continuous flow of data or instructions, typically one having a constant or predictable rate.',
                'a group in which schoolchildren of the same age and ability are taught',
                '(of liquid, air, gas, etc.) run or flow in a continuous current in a specified direction',
                'transmit or receive (data, especially video and audio material) over the Internet as a steady, continuous flow.',
                'put (schoolchildren) in groups of the same age and ability to be taught together',
                'a natural body of running water flowing on or under the earth']


# 1. Vectorize the text
tfidf_vectorizer  = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(documents_lst)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)

# 2. Get the number of clusters to make .. (find a better way than random)
num_clusters = 3

# 3. Cluster the defintions
km = KMeans(n_clusters=num_clusters, init='k-means++').fit(tfidf_matrix)

clusters = km.labels_.tolist()

print(clusters)

Which returns:

tfidf_matrix.shape:  (8, 39)
[0, 1, 0, 2, 1, 0, 2, 0]

Second Part

The failing part:

predict_doc = ['A stream is a body of water with a current, confined within a bed and banks.']

tfidf_vectorizer  = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf_vectorizer.fit_transform(predict_doc)
print('tfidf_matrix.shape: ', tfidf_matrix.shape)

km.predict(tfidf_matrix)

The error:

ValueError: Incorrect number of features. Got 7 features, expected 39

FWIW: I somewhat understand that the training and predict have a different amount of features after vectorizing ...

I am open to any solution including changing from kmeans to an algorithm more suitable for short text clustering.

Thanks in advance

525

asked Mar 16 '17 05:03

Itay Livni

1 Answers

For completeness I will answer my own question with an answer from here , that doesn't answer that question. But answers mine

from sklearn.cluster import KMeans

list1 = ["My name is xyz", "My name is pqr", "I work in abc"]
list2 = ["My name is xyz", "I work in abc"]

vectorizer = TfidfVectorizer(min_df = 0, max_df=0.5, stop_words = "english", charset_error = "ignore", ngram_range = (1,3))
vec = vectorizer.fit(list1)   # train vec using list1
vectorized = vec.transform(list1)   # transform list1 using vec

km = KMeans(n_clusters=2, init='k-means++', n_init=10, max_iter=1000, tol=0.0001, precompute_distances=True, verbose=0, random_state=None, n_jobs=1)

km.fit(vectorized)
list2Vec = vec.transform(list2)  # transform list2 using vec
km.predict(list2Vec)

The credit goes to @IrshadBhat

111

answered Oct 13 '22 23:10

Itay Livni

Related questions
                            
                                Creating new file in python causes FileNotFoundError
                            
                                understanding '*' "keyword only" argument notation in python3 functions [duplicate]
                            
                                Copying one file to multiple remote hosts in parallel over SFTP
                            
                                Where can I find an overview of how the ec2.instancesCollection is built
                            
                                why can't import pandas after installed successfully?
                            
                                Why does Python extend output [[...]] [duplicate]
                            
                                How to run custom command with tox without specifying it in tox.ini?
                            
                                Python Beautifulsoup Find_all except
                            
                                Install NumPy for Python 3.5
                            
                                How to return elements with the highest occurrence in list?
                            
                                Keras/Tensorflow predict: error in array shape
                            
                                How can I asyncio schedule a filesystem stat operation?
                            
                                Difference between __new__ and __init__ order in Python2/3
                            
                                UTF-16 codepoint counting in python
                            
                                SQLAlchemy - AttributeError: _reverse_property
                            
                                Why are non integral builtin types allowed in python slices?
                            
                                How do I open a text file in Python?
                            
                                Expression simplification in SymPy
                            
                                pandas automatically converting my string column to float
                            
                                Return predicted values from a rolling regression grouped by id using Pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using predict on new text with kmeans (sklearn)?

Tags:

python-3.x

nlp

k-means

scikit-learn

First Part

Second Part

The error:

Itay Livni

People also ask

1 Answers

Itay Livni

Recent Activity

Donate For Us