Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

IPython Notebook Kernel getting dead while running Kmeans

I am running K-means clustering on some 400K observations with 12 variables. Initially as soon as I run the cell with Kmeans code, it would pop up a message after 2 mins saying the kernel is interrupted and would restart. And then it takes ages like as if the kernel got dead and the code won't run anymore.

So I tried with 125k observations and same no. of variables. But still the same message I got.

What is meant by that?. Does it mean ipython notebook is not able to run kmeans on 125k observations and kills the kernel?.

How to solve this?. This is pretty important for me to do by today. :(

Please advise.

Code I used:

from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
kmeans=KMeans(n_clusters=2,init='k-means++',n_init=10, max_iter=100)
kmeans.fit(Data_sampled.ix[:,1:])
cluster_labels = kmeans.labels_
    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
silhouette_avg = silhouette_score(Data_sampled.ix[:,1:],cluster_labels)
like image 895
Baktaawar Avatar asked Sep 14 '15 21:09

Baktaawar


People also ask

Why do I keep getting dead kernel in Jupyter notebook?

A kernel error occurs basically when you try opening a python 3 file in the wrong directory. The truth is Jupyter and Python are two different software entirely. So, a kernel error occurs when Jupyter fails to connect with a specific version of Python.

How do you fix a dead kernel?

You can also go to "File" > "Close and Halt" within Jupyter if you still have the notebook open on your screen. Once you've closed the other notebooks, you can restart your dead kernel by going to "Kernel" > "Restart" within Jupyter.

How do I restart my Jupyter notebook kernel?

You can restart your Jupyter Kernel by simply clicking Kernel > Restart from the Jupyter menu. Note: This will reset your notebook and remove all variables or methods you've defined! Sometimes you'll notice that your notebook is still hanging after you've restart the kernel.


1 Answers

From some investigation, this likely has nothing to do with iPython Notebook / Jupyter. It seems this is an issue with sklearn, which traces back to an issue with numpy. See related github issues sklearn here and here, and the underlying numpy issue here.

Ultimately, calculating the Silhouette Score requires calculating a very large distance matrix, and it seems that distance matrix is taking up too much memory on your system for large numbers of rows. For instance, look at memory pressure on my system (OSX, 8GB ram) during two runs of a similar calculation - the first spike is a Silhouette Score calculation with 10k records, the second ... plateau .. is with 40k records:

memory pressure

Per the related SO answer here, your kernel process is probably getting killed by the OS because it is taking too much memory.

Ultimately, this is going to require some fixes in the underlying codebase for sklearn and/or numpy. Some options that you can try in the interim:

  • close every extraneous program running on your computer (spotify, slack, etc.), hope that frees up enough memory, and monitor memory closely while your script is running
  • run the calculation on a temporary remote server with more RAM than your machine has and see if that helps (although since I think the memory use is at least polynomial with respect to the number of samples, this may not work)
  • train your classifier with your full data set, but then calculate silhouette scores with a random subset of your data. (most people seem to be able to get this working with 20-30k observations)

Or, if you're smarter than me and have some free time, consider trying out contributing a fix to sklearn and/or numpy :)

like image 81
Owen Avatar answered Nov 04 '22 07:11

Owen