First up, this is most certainly homework (so no full code samples please). That said... I need to test an unsupervised algorithm next to a supervised algorithm, using the Neural Network toolbox in Matlab. The data set is the UCI Artificial Characters Database. The problem is, I've had a good tutorial on supervised algorithms, and been left to sink on unsupervised. So I know how to create a self organising map using <code>selforgmap</code>, and then I train it using <code>train(net, trainingSet)</code>. I don't understand what to do next. I know that it's clustered the data I gave it into (hopefully) 10 clusters (one for each letter). Two questions then: <ul> <li>How can I then label the clusters (given that I have a comparison pattern)? <ul> <li>Am I trying to turn this into a supervised learning problem when I do this?</li> </ul> </li> <li>How can I create a confusion matrix on (another) testing set to compare to the supervised algorithm? </li> </ul> I think I'm missing something conceptual or jargon-based here - all my searches come up with supervised learning techniques. A point in the right direction would be much appreciated. My existing code is below: <pre class="prettyprint"><code>P = load('-ascii', 'pattern'); T = load('-ascii', 'target'); % data needs to be translated P = P'; T = T'; T = T(find(sum(T')), :); mynet = selforgmap([10 10]); mynet.trainparam.epochs = 5000; mynet = train(mynet, P); P = load('-ascii', 'testpattern'); T = load('-ascii', 'testtarget'); P = P'; T = T'; T = T(find(sum(T')), :); Y = sim(mynet,P); Z = compet(Y); % this gives me a confusion matrix for supervised techniques: C = T*Z' </code></pre>

Could this video be of any help? It doesn't answer your question but it shows that human interaction may be required to even select number of clusters. Automatically labeling clusters is even harder. If you think about it there's no guarantee that clustering will be done based on the depicted number. Network might group digits based on width of the line or on the smoothing of the font, etc.

How can we use unsupervised learning techniques on a data-set, and then label the clusters?

Tags:

machine-learning

neural-network

matlab

unsupervised-learning

First up, this is most certainly homework (so no full code samples please). That said...

I need to test an unsupervised algorithm next to a supervised algorithm, using the Neural Network toolbox in Matlab. The data set is the UCI Artificial Characters Database. The problem is, I've had a good tutorial on supervised algorithms, and been left to sink on unsupervised.

So I know how to create a self organising map using selforgmap, and then I train it using train(net, trainingSet). I don't understand what to do next. I know that it's clustered the data I gave it into (hopefully) 10 clusters (one for each letter).

Two questions then:

How can I then label the clusters (given that I have a comparison pattern)?
- Am I trying to turn this into a supervised learning problem when I do this?
How can I create a confusion matrix on (another) testing set to compare to the supervised algorithm?

I think I'm missing something conceptual or jargon-based here - all my searches come up with supervised learning techniques. A point in the right direction would be much appreciated. My existing code is below:

P = load('-ascii', 'pattern');
T = load('-ascii', 'target');

% data needs to be translated
P = P';
T = T';

T = T(find(sum(T')), :);

mynet = selforgmap([10 10]);
mynet.trainparam.epochs = 5000;
mynet = train(mynet, P);


P = load('-ascii', 'testpattern');
T = load('-ascii', 'testtarget');

P = P';
T = T';
T = T(find(sum(T')), :);

Y = sim(mynet,P);
Z = compet(Y);

% this gives me a confusion matrix for supervised techniques:
C = T*Z'

335

asked Oct 09 '12 03:10

Hotchips

2 Answers

Since you don't employ any part of labelled data you are applying an unsupervised method by definition.

"How can I then label the clusters (given that I have a comparison pattern)?"

You can try different perturbations of the label-set and keep the one the minimizes the average error (or accuracy) on the comparison pattern. With clustering, you can label your clusters in any way you like. Think of it like trying different label assignments until you minimizes a specified performance metric.

"Am I trying to turn this into a supervised learning problem when I do this?"

It depends. If you explicitly use (known) data-points in the process of clustering, then this is semi-supervised. If not, you merely use the labeling information to evaluate and "compare" with supervised approaches. It is a form of supervision, but not based on training set, but on the best-case expected performance (i.e. an "agent" specifies correct labels to clusters).

"How can I create a confusion matrix on (another) testing set to compare to the supervised algorithm?"

You need a way to turn clusters into labelled classes. For a small number of clusters (e.g. C <= 5), you could essentially create C! matrices, and keep the one that minimizes your average classification error. In your case however, with C = 10, this is, obviously, impractical and a grave overhead!

As alternatives, you can label the clusters (and thus obtain confusion matrices) using:

Semi-supervised approaches, where the clusters may be labelled a-priori, or guided through a seeding process by data belonging to known cluster/classes.
Ranking or finding distances between the estimated cluster centroids and the ground-truth labels. This will assign the closest-ranked or most similar label to each cluster.

167

answered Nov 10 '22 06:11

gevang

Could this video be of any help? It doesn't answer your question but it shows that human interaction may be required to even select number of clusters. Automatically labeling clusters is even harder.

If you think about it there's no guarantee that clustering will be done based on the depicted number. Network might group digits based on width of the line or on the smoothing of the font, etc.

answered Nov 10 '22 07:11

Ivan Koblik

Related questions
                            
                                wit.ai: how does it identify intent and classifies entities from user expressions
                            
                                ValueError: Input 0 is incompatible with layer conv_1: expected ndim=3, found ndim=4
                            
                                Weird accuracy in multilabel classification keras
                            
                                legacy_init_op in TensorFlow Serving
                            
                                Multidimensional Input to Keras
                            
                                What is "linear projection" in convolutional neural network
                            
                                How to use the function merge and switch of tensorflow?
                            
                                How to get results from custom loss function in Keras?
                            
                                Understanding decision_function values
                            
                                4D input in LSTM layer in Keras
                            
                                Data Preprocessing for NLP Pre-training Models (e.g. ELMo, Bert)
                            
                                Passing tensorDataset or Dataloader to skorch
                            
                                data shuffling by sample() decreases RMSE to lower value in testingset than trainingset
                            
                                How do loss functions know for which model to compute gradients in PyTorch?
                            
                                Using quantile in Flux (Julia) in loss function
                            
                                Why does the importance parameter influence performance of Random Forest in R?
                            
                                Does imblearn pipeline turn off sampling for testing?
                            
                                How to create a simple Gradient Descent algorithm
                            
                                svm for binary data with hamming distance
                            
                                Which is a better method? libsvm or svmclassify?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With