I'm using a multiclass classifier (a Support Vector Machine, via One-Vs-All) to classify data samples. Let's say I currently have <code>n</code> distinct classes. However, in the scenario I'm facing, it is possible that a new data sample may belong to a new class <code>n+1</code> that hasn't been seen before. So I guess you can say that I need a form of Online Learning, as there is no distinct training set in the beginning that suits all data appearing later. Instead I need the SVM to adapt dynamically to new classes that may appear in the future. So I'm wondering about if and how I can... <ol> <li>identify that a new data sample does not quite fit into the existing classes but instead should result in creating a new class.</li> <li>integrate that new class into the existing classifier.</li> </ol> I can vaguely think of a few ideas that might be approaches to solve this problem: <ol> <li>If none of the binary SVM classifiers (as I have one for each class in the OVA case) predicts a fairly high probability (e.g. > 0.5) for the new data sample, I could assume that this new data sample may represent a new class.</li> <li>I could train a new binary classifier for that new class and add it to the multiclass SVM.</li> </ol> However, these are just my naive thoughts. I'm wondering if there is some "proper" approach for this instead, e.g. using a Clustering algorithms to find all classes. Or maybe my approach of trying to use an SVM for this is not even appropriate for this kind of problem? Help on this is greatly appreciated.

As in any other machine learning problem, if you do not have a quality criterion, you suck. When people say "classification", they have supervised learning in mind: there is some ground truth against which you can train and check your algorithms. If new classes can appear, this ground truth is ambiguous. Imagine one class is "horse", and you see many horses: black horses, brown horses, even white ones. And suddenly you see a zebra. Whoa! Is it a new class or just an unusual horse? The answer will depend on how you are going to use your class labels. The SVM itself cannot decide, because SVM does not use these labels, it only produces them. The decision is up to a human (or to some decision-making algorithm which knows what is "good" and "bad", that is, has its own "loss function" or "utility function"). So you need a supervisor. But how can you assist this supervisor? Two options come to mind: <ol> <li>Anomaly detection. This can help you with early occurences of new classes. After the very first zebra your algorithm sees it can raise an alarm: "There is something unusual!". For example, in sklearn various algorithms from random forest to one-class SVM can be used to detect unusial observations. Then your supervisor can look at them and decide whether they deserve to form an entirely new class.</li> <li>Clustering. It can help you to make decision about splitting your classes. For example, after the first zebra, you decided it is not worth making a new class. But over time, your algorithm has accumulated dozens of their images. So if you run a clustering algorithm on all the observations labeled as "horses", you might end up with two well-separated clusters. And it will be again up to the supervisor to decide, whether the striped horses should be detached from the plain ones into a new class. </li> </ol> If you want this decision to be purely authomatic, you can split classes if the ratio of within-cluster mean distance to between-cluster distance is low enough. But it will work well only if you have a good distance metric in the first place. And what is "good" is again defined by how you use your algorithms and what your ultimate goal is.

How to discover new classes in a classification machine learning algorithm?

Tags:

machine-learning

classification

svm

I'm using a multiclass classifier (a Support Vector Machine, via One-Vs-All) to classify data samples. Let's say I currently have n distinct classes.

However, in the scenario I'm facing, it is possible that a new data sample may belong to a new class n+1 that hasn't been seen before.

So I guess you can say that I need a form of Online Learning, as there is no distinct training set in the beginning that suits all data appearing later. Instead I need the SVM to adapt dynamically to new classes that may appear in the future.

So I'm wondering about if and how I can...

identify that a new data sample does not quite fit into the existing classes but instead should result in creating a new class.
integrate that new class into the existing classifier.

I can vaguely think of a few ideas that might be approaches to solve this problem:

If none of the binary SVM classifiers (as I have one for each class in the OVA case) predicts a fairly high probability (e.g. > 0.5) for the new data sample, I could assume that this new data sample may represent a new class.
I could train a new binary classifier for that new class and add it to the multiclass SVM.

However, these are just my naive thoughts. I'm wondering if there is some "proper" approach for this instead, e.g. using a Clustering algorithms to find all classes.

Or maybe my approach of trying to use an SVM for this is not even appropriate for this kind of problem?

Help on this is greatly appreciated.

764

asked Dec 13 '15 17:12

Oliver

1 Answers

As in any other machine learning problem, if you do not have a quality criterion, you suck.

When people say "classification", they have supervised learning in mind: there is some ground truth against which you can train and check your algorithms. If new classes can appear, this ground truth is ambiguous. Imagine one class is "horse", and you see many horses: black horses, brown horses, even white ones. And suddenly you see a zebra. Whoa! Is it a new class or just an unusual horse? The answer will depend on how you are going to use your class labels. The SVM itself cannot decide, because SVM does not use these labels, it only produces them. The decision is up to a human (or to some decision-making algorithm which knows what is "good" and "bad", that is, has its own "loss function" or "utility function").

So you need a supervisor. But how can you assist this supervisor? Two options come to mind:

Anomaly detection. This can help you with early occurences of new classes. After the very first zebra your algorithm sees it can raise an alarm: "There is something unusual!". For example, in sklearn various algorithms from random forest to one-class SVM can be used to detect unusial observations. Then your supervisor can look at them and decide whether they deserve to form an entirely new class.
Clustering. It can help you to make decision about splitting your classes. For example, after the first zebra, you decided it is not worth making a new class. But over time, your algorithm has accumulated dozens of their images. So if you run a clustering algorithm on all the observations labeled as "horses", you might end up with two well-separated clusters. And it will be again up to the supervisor to decide, whether the striped horses should be detached from the plain ones into a new class.

If you want this decision to be purely authomatic, you can split classes if the ratio of within-cluster mean distance to between-cluster distance is low enough. But it will work well only if you have a good distance metric in the first place. And what is "good" is again defined by how you use your algorithms and what your ultimate goal is.

103

answered Sep 21 '22 23:09

David Dale

Related questions
                            
                                Libsvm precomputed kernels
                            
                                Production architecture for big data real time machine learning application?
                            
                                Using adaboost within R's caret package
                            
                                Is Apache Spark less accurate than Scikit Learn?
                            
                                Use a metric after a classifier in a Pipeline
                            
                                How to include batch size in pytorch basic example?
                            
                                Problem with missing and unexpected keys while loading my model in Pytorch
                            
                                Classify data using Apache Mahout
                            
                                No. of hidden layers, units in hidden layers and epochs till Neural Network starts behaving acceptable on Training data
                            
                                How do you visualize a ward tree from sklearn.cluster.ward_tree?
                            
                                Is the xgboost documentation wrong ? (early stopping rounds and best and last iteration)
                            
                                Should binary features be one-hot encoded?
                            
                                Python OCR: ignore signatures in documents
                            
                                Keras reports TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'
                            
                                Why does the gated activation function (used in Wavenet) work better than a ReLU?
                            
                                Principal Component Analysis (PCA) on huge sparse dataset
                            
                                Predicting Football match winners based only on previous data of same match
                            
                                Denormalization of predicted data in neural networks
                            
                                Tensorflow minimise with respect to only some elements of a variable
                            
                                Python - A way to learn and detect text patterns?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With