Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I use autoencoder for clustering?

In the below code, they use autoencoder as supervised clustering or classification because they have data labels. http://amunategui.github.io/anomaly-detection-h2o/ But, can I use autoencoder to cluster data if I did not have its labels.? Regards

like image 317
forever Avatar asked Nov 24 '16 06:11

forever


2 Answers

The deep-learning autoencoder is always unsupervised learning. The "supervised" part of the article you link to is to evaluate how well it did.

The following example (taken from ch.7 of my book, Practical Machine Learning with H2O, where I try all the H2O unsupervised algorithms on the same data set - please excuse the plug) takes 563 features, and tries to encode them into just two hidden nodes.

m <- h2o.deeplearning(
  2:564, training_frame = tfidf,
  hidden = c(2), auto-encoder = T, activation = "Tanh"
  )
f <- h2o.deepfeatures(m, tfidf, layer = 1)

The second command there extracts the hidden node weights. f is a data frame, with two numeric columns, and one row for every row in the tfidf source data. I chose just two hidden nodes so that I could plot the clusters:

simplest autoencoder output

Results will change on each run. You can (maybe) get better results with stacked auto-encoders, or using more hidden nodes (but then you cannot plot them). Here I felt the results were limited by the data.

BTW, I made the above plot with this code:

d <- as.matrix(f[1:30,]) #Just first 30, to avoid over-cluttering
labels <- as.vector(tfidf[1:30, 1])
plot(d, pch = 17) #Triangle
text(d, labels, pos = 3) #pos=3 means above

(P.S. The original data came from Brandon Rose's excellent article on using NLTK. )

like image 61
Darren Cook Avatar answered Oct 23 '22 22:10

Darren Cook


In some aspects encoding data and clustering data share some overlapping theory. As a result, you can use Autoencoders to cluster(encode) data.

A simple example to visualize is if you have a set of training data that you suspect has two primary classes. Such as voter history data for republicans and democrats. If you take an Autoencoder and encode it to two dimensions then plot it on a scatter plot, this clustering becomes more clear. Below is a sample result from one of my models. You can see a noticeable split between the two classes as well as a bit of expected overlap. Clustered Voter Data

The code can be found here

This method does not require only two binary classes, you could also train on as many different classes as you wish. Two polarized classes is just easier to visualize.

This method is not limited to two output dimensions, that was just for plotting convenience. In fact, you may find it difficult to meaningfully map certain, large dimension spaces to such a small space.

In cases where the encoded (clustered) layer is larger in dimension it is not as clear to "visualize" feature clusters. This is where it gets a bit more difficult, as you'll have to use some form of supervised learning to map the encoded(clustered) features to your training labels.

A couple ways to determine what class features belong to is to pump the data into knn-clustering algorithm. Or, what I prefer to do is to take the encoded vectors and pass them to a standard back-error propagation neural network. Note that depending on your data you may find that just pumping the data straight into your back-propagation neural network is sufficient.

like image 1
Kenny Cason Avatar answered Oct 23 '22 21:10

Kenny Cason