Changes of clustering results after each time run in Python scikit-learn

Tags:

I have a bunch of sentences and I want to cluster them using scikit-learn spectral clustering. I've run the code and get the results with no problem. But, every time I run it I get different results. I know this is the problem with initiation but I don't know how to fix it. This is my a part of my code that runs on sentences:

vectorizer = TfidfVectorizer(norm='l2',sublinear_tf=True,tokenizer=tokenize,stop_words='english',charset_error="ignore",ngram_range=(1, 5),min_df=1)
X = vectorizer.fit_transform(data)
# connectivity matrix for structured Ward
connectivity = kneighbors_graph(X, n_neighbors=5)
# make connectivity symmetric
connectivity = 0.5 * (connectivity + connectivity.T)
distances = euclidean_distances(X)
spectral = cluster.SpectralClustering(n_clusters=number_of_k,eigen_solver='arpack',affinity="nearest_neighbors",assign_labels="discretize")
spectral.fit(X)

Data is a list of sentences. Everytime the code runs, my clustering results differs. How can I get consistent results using Spectral clustering. I also have the same problem with Kmean. This is my code for Kmean:

vectorizer = TfidfVectorizer(sublinear_tf=True,stop_words='english',charset_error="ignore")
X_data = vectorizer.fit_transform(data)
km = KMeans(n_clusters=number_of_k, init='k-means++', max_iter=100, n_init=1,verbose=0)
km.fit(X_data)

I appreciate your helps.

544

asked Sep 18 '14 20:09

user3430235

2 Answers

When using k-means, you want to set the random_state parameter in KMeans (see the documentation). Set this to either an int or a RandomState instance.

km = KMeans(n_clusters=number_of_k, init='k-means++', 
            max_iter=100, n_init=1, verbose=0, random_state=3425)
km.fit(X_data)

This is important because k-means is not a deterministic algorithm. It usually starts with some randomized initialization procedure, and this randomness means that different runs will start at different points. Seeding the pseudo-random number generator ensures that this randomness will always be the same for identical seeds.

I'm not sure about the spectral clustering example though. From the documentation on the random_state parameter: "A pseudo random number generator used for the initialization of the lobpcg eigen vectors decomposition when eigen_solver == 'amg' and by the K-Means initialization." OP's code doesn't seem to be contained in those cases, though setting the parameter might be worth a shot.

199

answered Sep 28 '22 03:09

Roger Fan

As the others already noted, k-means is usually implemented with randomized initialization. It is intentional that you can get different results.

The algorithm is only a heuristic. It may yield suboptimal results. Running it multiple times gives you a better chance of finding a good result.

In my opinion, when the results vary highly from run to run, this indicates that the data just does not cluster well with k-means at all. Your results are not much better than random in such a case. If the data is really suited for k-means clustering, the results will be rather stable! If they vary, the clusters may not have the same size, or may be not well separated; and other algorithms may yield better results.

answered Sep 28 '22 02:09

Has QUIT--Anony-Mousse

Related questions
                            
                                Insert static files literally into Jinja templates without parsing them
                            
                                Why map(print, a_list) doesn't work?
                            
                                How to check if given variable exist in jinja2 template?
                            
                                How to use Flask-Security register view?
                            
                                matplotlib hooking in to home/back/forward button events
                            
                                Finding clusters of numbers in a list
                            
                                Is Python dict an Object?
                            
                                How to iterate over worksheets in workbook, openpyxl
                            
                                Compare two dates in python and ignoring microseconds
                            
                                Emacs 24.3 python: Can't guess python-indent-offset, using defaults 4
                            
                                How to update an image on a Canvas?
                            
                                How can I use scipy.ndimage.interpolation.affine_transform to rotate an image about its centre?
                            
                                Insert Null into SQLite3 in Python
                            
                                Get reference to the current exception
                            
                                How to calculate cohen's d in Python?
                            
                                Extrapolate values in Pandas DataFrame
                            
                                Tkinter main window focus
                            
                                Python Pandas to R dataframe
                            
                                Why can't use semi-colon before for loop in Python?
                            
                                Converting timezones from pandas Timestamps

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Changes of clustering results after each time run in Python scikit-learn

Tags:

python

cluster-analysis

k-means

scikit-learn

spectral

user3430235

People also ask

2 Answers

Roger Fan

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us