Can someone please explain (with example maybe) what is the difference between OneVsRestClassifier and MultiOutputClassifier in scikit-learn?
I've read documentation and I've understood that we use:
I've already used OneVsRestClassifier for multilabel classification and I can understand how does it work but then I found MultiOutputClassifier and can't understand how does it work differently from OneVsRestClassifier.
Difference between multi-class classification & multi-label classification is that in multi-class problems the classes are mutually exclusive, whereas for multi-label problems each label represents a different classification task, but the tasks are somehow related.
OneVsRestClassifier - when we want to do multiclass or multilabel classification and it's strategy consists of fitting one classifier per class. For each classifier, the class is fitted against all the other classes.
all provides a way to leverage binary classification. Given a classification problem with N possible solutions, a one-vs. -all solution consists of N separate binary classifiers—one binary classifier for each possible outcome.
One-Vs-Rest for Multi-Class Classification. One-vs-rest (OvR for short, also referred to as One-vs-All or OvA) is a heuristic method for using binary classification algorithms for multi-class classification. It involves splitting the multi-class dataset into multiple binary classification problems.
To better illustrate the differences, let us assume that your goal is that of classifying SO questions into n_classes
different, mutually exclusive classes. For the sake of simplicity in this example we will only consider four classes, namely 'Python'
, 'Java'
, 'C++'
and 'Other language'
. Let us assume that you have a dataset formed by just six SO questions, and the class labels of those questions are stored in an array y
as follows:
import numpy as np y = np.asarray(['Java', 'C++', 'Other language', 'Python', 'C++', 'Python'])
The situation described above is usually referred to as multiclass classification (also known as multinomial classification). In order to fit the classifier and validate the model through scikit-learn library you need to transform the text class labels into numerical labels. To accomplish that you could use LabelEncoder:
from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y_numeric = le.fit_transform(y)
This is how the labels of your dataset are encoded:
In [220]: y_numeric Out[220]: array([1, 0, 2, 3, 0, 3], dtype=int64)
where those numbers denote indices of the following array:
In [221]: le.classes_ Out[221]: array(['C++', 'Java', 'Other language', 'Python'], dtype='|S14')
An important particular case is when there are just two classes, i.e. n_classes = 2
. This is usually called binary classification.
Let us now suppose that you wish to perform such multiclass classification using a pool of n_classes
binary classifiers, being n_classes
the number of different classes. Each of these binary classifiers makes a decision on whether an item is of a specific class or not. In this case you cannot encode class labels as integer numbers from 0
to n_classes - 1
, you need to create a 2-dimensional indicator matrix instead. Consider that sample n
is of class k
. Then, the [n, k]
entry of the indicator matrix is 1
and the rest of the elements in row n
are 0
. It is important to note that if the classes are not mutually exclusive there can be multiple 1
's in a row. This approach is named multilabel classification and can be easily implemented through MultiLabelBinarizer:
from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() y_indicator = mlb.fit_transform(y[:, None])
The indicator looks like this:
In [225]: y_indicator Out[225]: array([[0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0], [0, 0, 0, 1]])
and the column numbers where 1
's are actually indices of this array:
In [226]: mlb.classes_ Out[226]: array(['C++', 'Java', 'Other language', 'Python'], dtype=object)
What if you want to classify a particular SO question according to two different criteria simultaneously, for instance language and application? In this case you intend to do multioutput classification. For the sake of simplicity I will consider only three application classes, namely 'Computer Vision'
, 'Speech Processing
' and 'Other application
'. The label array of your dataset should be 2-dimensional:
y2 = np.asarray([['Java', 'Computer Vision'], ['C++', 'Speech Recognition'], ['Other language', 'Computer Vision'], ['Python', 'Other Application'], ['C++', 'Speech Recognition'], ['Python', 'Computer Vision']])
Again, we need to transform text class labels into numeric labels. As far as I know this functionality is not implemented in scikit-learn yet, so you will need to write your own code. This thread describes some clever ways to do that, but for the purposes of this post the following one-liner should suffice:
y_multi = np.vstack((le.fit_transform(y2[:, i]) for i in range(y2.shape[1]))).T
The encoded labels look like this:
In [229]: y_multi Out[229]: array([[1, 0], [0, 2], [2, 0], [3, 1], [0, 2], [3, 0]], dtype=int64)
And the meaning of the values in each column can be inferred from the following arrays:
In [230]: le.fit(y2[:, 0]).classes_ Out[230]: array(['C++', 'Java', 'Other language', 'Python'], dtype='|S18') In [231]: le.fit(y2[:, 1]).classes_ Out[231]: array(['Computer Vision', 'Other Application', 'Speech Recognition'], dtype='|S18')
This is an extension to @tonechas answer. Read that answer before reading this. OVR supports Multilabel only when each label is a binary label/ class (also called binary multi-label) i.e., either the sample belongs to that label or doesn't. It will not work when the target is multioutput (also called multi-class multi-label) i.e. when each sample can belong to any one class within a label. For the later case, you need to use sklearn Multioutput classifier.
In otherwords, sklearn OVR does not work when your target variable looks like this,
y_true = np.arr([[2, 1, 0], [0, 2, 1], [1, 2, 4]])
where label1 has 4 classes [0, 1, 2, 3]; label2 has 3 classes [0, 1, 2]; label3 has 5 classes [0, 1, 2 , 3, 4]. Ex: The first sample belongs to class 2 in the label1, class 1 in label2, class 0 in label3. Think of it as the labels NOT being mutually exclusive while the classes within each label being mutually exclusive.
Sklearn OVR will work when,
y_true = np.arr([[0, 1, 1], [0, 0, 1], [1, 1, 0]])
where label1 labe2, label3 have only 2 classes each. So, a sample either belongs to that label or doesn't. Ex: The first sample belongs to label1 and label2.
I am sorry I couldn't find a real-world example for this kind of a usecase.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With