I am running K-means clustering on some 400K observations with 12 variables. Initially as soon as I run the cell with Kmeans code, it would pop up a message after 2 mins saying the kernel is interrupted and would restart. And then it takes ages like as if the kernel got dead and the code won't run anymore.
So I tried with 125k observations and same no. of variables. But still the same message I got.
What is meant by that?. Does it mean ipython notebook is not able to run kmeans on 125k observations and kills the kernel?.
How to solve this?. This is pretty important for me to do by today. :(
Please advise.
Code I used:
from sklearn.cluster import KMeans from sklearn.metrics import silhouette_score
# Initialize the clusterer with n_clusters value and a random generator
# seed of 10 for reproducibility.
kmeans=KMeans(n_clusters=2,init='k-means++',n_init=10, max_iter=100)
kmeans.fit(Data_sampled.ix[:,1:])
cluster_labels = kmeans.labels_
# The silhouette_score gives the average value for all the samples.
# This gives a perspective into the density and separation of the formed
# clusters
silhouette_avg = silhouette_score(Data_sampled.ix[:,1:],cluster_labels)
A kernel error occurs basically when you try opening a python 3 file in the wrong directory. The truth is Jupyter and Python are two different software entirely. So, a kernel error occurs when Jupyter fails to connect with a specific version of Python.
You can also go to "File" > "Close and Halt" within Jupyter if you still have the notebook open on your screen. Once you've closed the other notebooks, you can restart your dead kernel by going to "Kernel" > "Restart" within Jupyter.
You can restart your Jupyter Kernel by simply clicking Kernel > Restart from the Jupyter menu. Note: This will reset your notebook and remove all variables or methods you've defined! Sometimes you'll notice that your notebook is still hanging after you've restart the kernel.
From some investigation, this likely has nothing to do with iPython Notebook / Jupyter. It seems this is an issue with sklearn
, which traces back to an issue with numpy
. See related github issues sklearn
here and here, and the underlying numpy issue here.
Ultimately, calculating the Silhouette Score requires calculating a very large distance matrix, and it seems that distance matrix is taking up too much memory on your system for large numbers of rows. For instance, look at memory pressure on my system (OSX, 8GB ram) during two runs of a similar calculation - the first spike is a Silhouette Score calculation with 10k records, the second ... plateau .. is with 40k records:
Per the related SO answer here, your kernel process is probably getting killed by the OS because it is taking too much memory.
Ultimately, this is going to require some fixes in the underlying codebase for sklearn
and/or numpy
. Some options that you can try in the interim:
Or, if you're smarter than me and have some free time, consider trying out contributing a fix to sklearn
and/or numpy
:)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With