I am running K-means clustering on some 400K observations with 12 variables. Initially as soon as I run the cell with Kmeans code, it would pop up a message after 2 mins saying the kernel is interrupted and would restart. And then it takes ages like as if the kernel got dead and the code won't run anymore. So I tried with 125k observations and same no. of variables. But still the same message I got. What is meant by that?. Does it mean ipython notebook is not able to run kmeans on 125k observations and kills the kernel?. How to solve this?. This is pretty important for me to do by today. :( Please advise. Code I used: from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score <pre class="prettyprint"><code> # Initialize the clusterer with n_clusters value and a random generator # seed of 10 for reproducibility. kmeans=KMeans(n_clusters=2,init='k-means++',n_init=10, max_iter=100) kmeans.fit(Data_sampled.ix[:,1:]) cluster_labels = kmeans.labels_ # The silhouette_score gives the average value for all the samples. # This gives a perspective into the density and separation of the formed # clusters silhouette_avg = silhouette_score(Data_sampled.ix[:,1:],cluster_labels) </code></pre>

From some investigation, this likely has nothing to do with iPython Notebook / Jupyter. It seems this is an issue with <code>sklearn</code>, which traces back to an issue with <code>numpy</code>. See related github issues <code>sklearn</code> here and here, and the underlying numpy issue here. Ultimately, calculating the Silhouette Score requires calculating a very large distance matrix, and it seems that distance matrix is taking up too much memory on your system for large numbers of rows. For instance, look at memory pressure on my system (OSX, 8GB ram) during two runs of a similar calculation - the first spike is a Silhouette Score calculation with 10k records, the second ... plateau .. is with 40k records: <img src="https://i.stack.imgur.com/UDrE2.png" alt="memory pressure"> Per the related SO answer here, your kernel process is probably getting killed by the OS because it is taking too much memory. Ultimately, this is going to require some fixes in the underlying codebase for <code>sklearn</code> and/or <code>numpy</code>. Some options that you can try in the interim: <ul> <li>close every extraneous program running on your computer (spotify, slack, etc.), hope that frees up enough memory, and monitor memory closely while your script is running</li> <li>run the calculation on a temporary remote server with more RAM than your machine has and see if that helps (although since I think the memory use is at least polynomial with respect to the number of samples, this may not work)</li> <li>train your classifier with your full data set, but then calculate silhouette scores with a random subset of your data. (most people seem to be able to get this working with 20-30k observations)</li> </ul> Or, if you're smarter than me and have some free time, consider trying out contributing a fix to <code>sklearn</code> and/or <code>numpy</code> :)

IPython Notebook Kernel getting dead while running Kmeans

Tags:

python

pandas

ipython

numpy

scikit-learn

I am running K-means clustering on some 400K observations with 12 variables. Initially as soon as I run the cell with Kmeans code, it would pop up a message after 2 mins saying the kernel is interrupted and would restart. And then it takes ages like as if the kernel got dead and the code won't run anymore.

So I tried with 125k observations and same no. of variables. But still the same message I got.

What is meant by that?. Does it mean ipython notebook is not able to run kmeans on 125k observations and kills the kernel?.

How to solve this?. This is pretty important for me to do by today. :(

Please advise.

Code I used:

from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
kmeans=KMeans(n_clusters=2,init='k-means++',n_init=10, max_iter=100)
kmeans.fit(Data_sampled.ix[:,1:])
cluster_labels = kmeans.labels_
    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
silhouette_avg = silhouette_score(Data_sampled.ix[:,1:],cluster_labels)

895

asked Sep 14 '15 21:09

Baktaawar

1 Answers

From some investigation, this likely has nothing to do with iPython Notebook / Jupyter. It seems this is an issue with sklearn, which traces back to an issue with numpy. See related github issues sklearn here and here, and the underlying numpy issue here.

Ultimately, calculating the Silhouette Score requires calculating a very large distance matrix, and it seems that distance matrix is taking up too much memory on your system for large numbers of rows. For instance, look at memory pressure on my system (OSX, 8GB ram) during two runs of a similar calculation - the first spike is a Silhouette Score calculation with 10k records, the second ... plateau .. is with 40k records:

memory pressure

Per the related SO answer here, your kernel process is probably getting killed by the OS because it is taking too much memory.

Ultimately, this is going to require some fixes in the underlying codebase for sklearn and/or numpy. Some options that you can try in the interim:

close every extraneous program running on your computer (spotify, slack, etc.), hope that frees up enough memory, and monitor memory closely while your script is running
run the calculation on a temporary remote server with more RAM than your machine has and see if that helps (although since I think the memory use is at least polynomial with respect to the number of samples, this may not work)
train your classifier with your full data set, but then calculate silhouette scores with a random subset of your data. (most people seem to be able to get this working with 20-30k observations)

Or, if you're smarter than me and have some free time, consider trying out contributing a fix to sklearn and/or numpy :)

answered Nov 04 '22 07:11

Owen

Related questions
                            
                                How do I improve remove duplicate algorithm?
                            
                                How to ship or distribute a matplotlib stylesheet
                            
                                How do you add httplib2 into ansible?
                            
                                Why is PyCharm unable to find the correct verion of pip to install a Python module?
                            
                                python logging close and application exit
                            
                                numpy cross-correlation - vectorizing
                            
                                python nested lists/dictionaries and popping values
                            
                                Sign up using G+ (Google+) using kivy
                            
                                Parsing a pdf(Devanagari script) using PDFminer gives incorrect output [duplicate]
                            
                                Pycharm doesn't show error information for `PyQt5` program (e.g. `TypeError`)
                            
                                matplotlib can't work in OS X with error ' TKApplication is implemented in both'
                            
                                ssim image compare error ''window_shape incompatible with arr_in.shape"
                            
                                Python script to send mail at specific time and day every month
                            
                                Reversing the order of key-value pairs in a dictionary (Python) [duplicate]
                            
                                Gunicorn Flask Caching
                            
                                Porting C defines the Pythonic way
                            
                                return vs print list
                            
                                Celery did not put task back in RabbitMQ queue after timeout
                            
                                Resize NumPy array to smaller size without copy
                            
                                json.encoder.FLOAT_REPR changed but no effect [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With