Can someone please explain (with example maybe) what is the difference between OneVsRestClassifier and MultiOutputClassifier in scikit-learn? I've read documentation and I've understood that we use: <ul> <li> OneVsRestClassifier - when we want to do multiclass or multilabel classification and it's strategy consists of fitting one classifier per class. For each classifier, the class is fitted against all the other classes. (This is pretty clear and it means that problem of multiclass/multilabel classification is broken down to multiple binary classification problems).</li> <li> MultiOutputClassifier - when we want to do multi target classification (what is this?) and it's strategy consists of fitting one classifier per target (what does target mean there?)</li> </ul> I've already used OneVsRestClassifier for multilabel classification and I can understand how does it work but then I found MultiOutputClassifier and can't understand how does it work differently from OneVsRestClassifier.

<h3>Multiclass classification</h3> To better illustrate the differences, let us assume that your goal is that of classifying SO questions into <code>n_classes</code> different, mutually exclusive classes. For the sake of simplicity in this example we will only consider four classes, namely <code>'Python'</code>, <code>'Java'</code>, <code>'C++'</code> and <code>'Other language'</code>. Let us assume that you have a dataset formed by just six SO questions, and the class labels of those questions are stored in an array <code>y</code> as follows: <pre class="prettyprint"><code>import numpy as np y = np.asarray(['Java', 'C++', 'Other language', 'Python', 'C++', 'Python']) </code></pre> The situation described above is usually referred to as multiclass classification (also known as multinomial classification). In order to fit the classifier and validate the model through scikit-learn library you need to transform the text class labels into numerical labels. To accomplish that you could use LabelEncoder: <pre class="prettyprint"><code>from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y_numeric = le.fit_transform(y) </code></pre> This is how the labels of your dataset are encoded: <pre class="prettyprint"><code>In [220]: y_numeric Out[220]: array([1, 0, 2, 3, 0, 3], dtype=int64) </code></pre> where those numbers denote indices of the following array: <pre class="prettyprint"><code>In [221]: le.classes_ Out[221]: array(['C++', 'Java', 'Other language', 'Python'], dtype='|S14') </code></pre> An important particular case is when there are just two classes, i.e. <code>n_classes = 2</code>. This is usually called binary classification. <h3>Multilabel classification</h3> Let us now suppose that you wish to perform such multiclass classification using a pool of <code>n_classes</code> binary classifiers, being <code>n_classes</code> the number of different classes. Each of these binary classifiers makes a decision on whether an item is of a specific class or not. In this case you cannot encode class labels as integer numbers from <code>0</code> to <code>n_classes - 1</code>, you need to create a 2-dimensional indicator matrix instead. Consider that sample <code>n</code> is of class <code>k</code>. Then, the <code>[n, k]</code> entry of the indicator matrix is <code>1</code> and the rest of the elements in row <code>n</code> are <code>0</code>. It is important to note that if the classes are not mutually exclusive there can be multiple <code>1</code>'s in a row. This approach is named multilabel classification and can be easily implemented through MultiLabelBinarizer: <pre class="prettyprint"><code>from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() y_indicator = mlb.fit_transform(y[:, None]) </code></pre> The indicator looks like this: <pre class="prettyprint"><code>In [225]: y_indicator Out[225]: array([[0, 1, 0, 0], [1, 0, 0, 0], [0, 0, 1, 0], [0, 0, 0, 1], [1, 0, 0, 0], [0, 0, 0, 1]]) </code></pre> and the column numbers where <code>1</code>'s are actually indices of this array: <pre class="prettyprint"><code>In [226]: mlb.classes_ Out[226]: array(['C++', 'Java', 'Other language', 'Python'], dtype=object) </code></pre> <h3>Multioutput classification</h3> What if you want to classify a particular SO question according to two different criteria simultaneously, for instance language and application? In this case you intend to do multioutput classification. For the sake of simplicity I will consider only three application classes, namely <code>'Computer Vision'</code>, <code>'Speech Processing</code>' and <code>'Other application</code>'. The label array of your dataset should be 2-dimensional: <pre class="prettyprint"><code>y2 = np.asarray([['Java', 'Computer Vision'], ['C++', 'Speech Recognition'], ['Other language', 'Computer Vision'], ['Python', 'Other Application'], ['C++', 'Speech Recognition'], ['Python', 'Computer Vision']]) </code></pre> Again, we need to transform text class labels into numeric labels. As far as I know this functionality is not implemented in scikit-learn yet, so you will need to write your own code. This thread describes some clever ways to do that, but for the purposes of this post the following one-liner should suffice: <pre class="prettyprint"><code>y_multi = np.vstack((le.fit_transform(y2[:, i]) for i in range(y2.shape[1]))).T </code></pre> The encoded labels look like this: <pre class="prettyprint"><code>In [229]: y_multi Out[229]: array([[1, 0], [0, 2], [2, 0], [3, 1], [0, 2], [3, 0]], dtype=int64) </code></pre> And the meaning of the values in each column can be inferred from the following arrays: <pre class="prettyprint"><code>In [230]: le.fit(y2[:, 0]).classes_ Out[230]: array(['C++', 'Java', 'Other language', 'Python'], dtype='|S18') In [231]: le.fit(y2[:, 1]).classes_ Out[231]: array(['Computer Vision', 'Other Application', 'Speech Recognition'], dtype='|S18') </code></pre>

This is an extension to @tonechas answer. Read that answer before reading this. OVR supports Multilabel only when each label is a binary label/ class (also called binary multi-label) i.e., either the sample belongs to that label or doesn't. It will not work when the target is multioutput (also called multi-class multi-label) i.e. when each sample can belong to any one class within a label. For the later case, you need to use sklearn Multioutput classifier. In otherwords, sklearn OVR does not work when your target variable looks like this, <pre class="prettyprint"><code>y_true = np.arr([[2, 1, 0], [0, 2, 1], [1, 2, 4]]) </code></pre> where label1 has 4 classes [0, 1, 2, 3]; label2 has 3 classes [0, 1, 2]; label3 has 5 classes [0, 1, 2 , 3, 4]. Ex: The first sample belongs to class 2 in the label1, class 1 in label2, class 0 in label3. Think of it as the labels NOT being mutually exclusive while the classes within each label being mutually exclusive. Sklearn OVR will work when, <pre class="prettyprint"><code>y_true = np.arr([[0, 1, 1], [0, 0, 1], [1, 1, 0]]) </code></pre> where label1 labe2, label3 have only 2 classes each. So, a sample either belongs to that label or doesn't. Ex: The first sample belongs to label1 and label2. I am sorry I couldn't find a real-world example for this kind of a usecase.

What is the difference between OneVsRestClassifier and MultiOutputClassifier in scikit learn?

Tags:

python

classification

scikit-learn

multilabel-classification

multiclass-classification

Can someone please explain (with example maybe) what is the difference between OneVsRestClassifier and MultiOutputClassifier in scikit-learn?

I've read documentation and I've understood that we use:

OneVsRestClassifier - when we want to do multiclass or multilabel classification and it's strategy consists of fitting one classifier per class. For each classifier, the class is fitted against all the other classes. (This is pretty clear and it means that problem of multiclass/multilabel classification is broken down to multiple binary classification problems).
MultiOutputClassifier - when we want to do multi target classification (what is this?) and it's strategy consists of fitting one classifier per target (what does target mean there?)

I've already used OneVsRestClassifier for multilabel classification and I can understand how does it work but then I found MultiOutputClassifier and can't understand how does it work differently from OneVsRestClassifier.

838

asked Mar 15 '17 19:03

PeterB

2 Answers

Multiclass classification

To better illustrate the differences, let us assume that your goal is that of classifying SO questions into n_classes different, mutually exclusive classes. For the sake of simplicity in this example we will only consider four classes, namely 'Python', 'Java', 'C++' and 'Other language'. Let us assume that you have a dataset formed by just six SO questions, and the class labels of those questions are stored in an array y as follows:

import numpy as np y = np.asarray(['Java', 'C++', 'Other language', 'Python', 'C++', 'Python'])

The situation described above is usually referred to as multiclass classification (also known as multinomial classification). In order to fit the classifier and validate the model through scikit-learn library you need to transform the text class labels into numerical labels. To accomplish that you could use LabelEncoder:

from sklearn.preprocessing import LabelEncoder le = LabelEncoder() y_numeric = le.fit_transform(y)

This is how the labels of your dataset are encoded:

In [220]: y_numeric Out[220]: array([1, 0, 2, 3, 0, 3], dtype=int64)

where those numbers denote indices of the following array:

In [221]: le.classes_ Out[221]:  array(['C++', 'Java', 'Other language', 'Python'],        dtype='|S14')

An important particular case is when there are just two classes, i.e. n_classes = 2. This is usually called binary classification.

Multilabel classification

Let us now suppose that you wish to perform such multiclass classification using a pool of n_classes binary classifiers, being n_classes the number of different classes. Each of these binary classifiers makes a decision on whether an item is of a specific class or not. In this case you cannot encode class labels as integer numbers from 0 to n_classes - 1, you need to create a 2-dimensional indicator matrix instead. Consider that sample n is of class k. Then, the [n, k] entry of the indicator matrix is 1 and the rest of the elements in row n are 0. It is important to note that if the classes are not mutually exclusive there can be multiple 1's in a row. This approach is named multilabel classification and can be easily implemented through MultiLabelBinarizer:

from sklearn.preprocessing import MultiLabelBinarizer mlb = MultiLabelBinarizer() y_indicator = mlb.fit_transform(y[:, None])

The indicator looks like this:

In [225]: y_indicator Out[225]:  array([[0, 1, 0, 0],        [1, 0, 0, 0],        [0, 0, 1, 0],        [0, 0, 0, 1],        [1, 0, 0, 0],        [0, 0, 0, 1]])

and the column numbers where 1's are actually indices of this array:

In [226]: mlb.classes_ Out[226]: array(['C++', 'Java', 'Other language', 'Python'], dtype=object)

Multioutput classification

What if you want to classify a particular SO question according to two different criteria simultaneously, for instance language and application? In this case you intend to do multioutput classification. For the sake of simplicity I will consider only three application classes, namely 'Computer Vision', 'Speech Processing' and 'Other application'. The label array of your dataset should be 2-dimensional:

y2 = np.asarray([['Java', 'Computer Vision'],                  ['C++', 'Speech Recognition'],                  ['Other language', 'Computer Vision'],                  ['Python', 'Other Application'],                  ['C++', 'Speech Recognition'],                  ['Python', 'Computer Vision']])

Again, we need to transform text class labels into numeric labels. As far as I know this functionality is not implemented in scikit-learn yet, so you will need to write your own code. This thread describes some clever ways to do that, but for the purposes of this post the following one-liner should suffice:

y_multi = np.vstack((le.fit_transform(y2[:, i]) for i in range(y2.shape[1]))).T

The encoded labels look like this:

In [229]: y_multi Out[229]:  array([[1, 0],        [0, 2],        [2, 0],        [3, 1],        [0, 2],        [3, 0]], dtype=int64)

And the meaning of the values in each column can be inferred from the following arrays:

In [230]: le.fit(y2[:, 0]).classes_ Out[230]:  array(['C++', 'Java', 'Other language', 'Python'],        dtype='|S18')  In [231]: le.fit(y2[:, 1]).classes_ Out[231]:  array(['Computer Vision', 'Other Application', 'Speech Recognition'],        dtype='|S18')

100

answered Oct 15 '22 09:10

Tonechas

This is an extension to @tonechas answer. Read that answer before reading this. OVR supports Multilabel only when each label is a binary label/ class (also called binary multi-label) i.e., either the sample belongs to that label or doesn't. It will not work when the target is multioutput (also called multi-class multi-label) i.e. when each sample can belong to any one class within a label. For the later case, you need to use sklearn Multioutput classifier.

In otherwords, sklearn OVR does not work when your target variable looks like this,

y_true = np.arr([[2, 1, 0],                  [0, 2, 1],                  [1, 2, 4]])

where label1 has 4 classes [0, 1, 2, 3]; label2 has 3 classes [0, 1, 2]; label3 has 5 classes [0, 1, 2 , 3, 4]. Ex: The first sample belongs to class 2 in the label1, class 1 in label2, class 0 in label3. Think of it as the labels NOT being mutually exclusive while the classes within each label being mutually exclusive.

Sklearn OVR will work when,

y_true = np.arr([[0, 1, 1],                  [0, 0, 1],                  [1, 1, 0]])

where label1 labe2, label3 have only 2 classes each. So, a sample either belongs to that label or doesn't. Ex: The first sample belongs to label1 and label2.

I am sorry I couldn't find a real-world example for this kind of a usecase.