Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Something wrong when implementing SVM One-vs-all in python

I was trying to verify that I had correctly understood how SVM - OVA (One-versus-All) works, by comparing the function OneVsRestClassifier with my own implementation.

In the following code, I implemented num_classes classifiers in the training phase, and then tested all of them on the testset and selected the one returning the highest probability value.

import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import scale

# Read dataset 
df = pd.read_csv('In/winequality-white.csv',  delimiter=';')
X = df.loc[:, df.columns != 'quality']
Y = df.loc[:, df.columns == 'quality']
my_classes = np.unique(Y)
num_classes = len(my_classes)

# Train-test split
np.random.seed(42)
msk = np.random.rand(len(df)) <= 0.8
train = df[msk]
test = df[~msk]

# From dataset to features and labels
X_train = train.loc[:, train.columns != 'quality']
Y_train = train.loc[:, train.columns == 'quality']
X_test = test.loc[:, test.columns != 'quality']
Y_test = test.loc[:, test.columns == 'quality']

# Models
clf =  [None] * num_classes
for k in np.arange(0,num_classes):
    my_model = SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced', probability=True)
    clf[k] = my_model.fit(X_train, Y_train==my_classes[k])

# Prediction
prob_table = np.zeros((len(Y_test), num_classes))
for k in np.arange(0,num_classes):
    p = clf[k].predict_proba(X_test)
    prob_table[:,k] = p[:,list(clf[k].classes_).index(True)]
Y_pred = prob_table.argmax(axis=1)

print("Test accuracy = ", accuracy_score( Y_test, Y_pred) * 100,"\n\n") 

Test accuracy is equal to 0.21, while when using the function OneVsRestClassifier, it returns 0.59. For completeness, I also report the other code (the pre-processing steps are the same as before):

....
clf = OneVsRestClassifier(SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced'))
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print("Test accuracy = ", accuracy_score( Y_test, Y_pred) * 100,"\n\n")

Is there something wrong in my own implementation of SVM - OVA?

like image 628
Alessandro Avatar asked Dec 15 '20 10:12

Alessandro


People also ask

Can I use SVM for multiclass classification?

In its most basic type, SVM doesn't support multiclass classification. For multiclass classification, the same principle is utilized after breaking down the multi-classification problem into smaller subproblems, all of which are binary classification problems.

What is OvO and Ovr?

Instead, heuristic methods can be used to split a multi-class classification problem into multiple binary classification datasets and train a binary classification model each. Two examples of these heuristic methods include: One-vs-Rest (OvR) One-vs-One (OvO)

What is one vs one classifier?

The goal is to create a classification model that can predict multiple classes, by using the one-versus-one approach. This component is useful for creating models that predict three or more possible outcomes, when the outcome depends on continuous or categorical predictor variables.

How make SVM faster?

You can use the kernel_approximation module to scale up SVMs to a large number of samples like this. Show activity on this post. This will use all available CPUs on your Computer, while still doing the same computation as before. Would you pass the n_jobs parameter to the OVR Classifier or to the Bagging Classifier ?


Video Answer


3 Answers

Is there something wrong in my own implementation of SVM - OVA?

You have unique classes array([3, 4, 5, 6, 7, 8, 9]), however the line Y_pred = prob_table.argmax(axis=1) assumes they are 0-indexed.

Try refactoring your code to be less error prone to assumptions like that:

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
df = pd.read_csv('winequality-white.csv',  delimiter=';')
y = df["quality"]
my_classes = np.unique(y)
X = df.drop("quality", axis=1)

X_train, X_test, Y_train, Y_test = train_test_split(X,y, random_state=42)

# Models
clfs =  []

for k in my_classes:
    my_model = SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced'
                   , probability=True, random_state=42)
    clfs.append(my_model.fit(X_train, Y_train==k))

# Prediction
prob_table = np.zeros((len(X_test),len(my_classes)))

for i,clf in enumerate(clfs):
    probs = clf.predict_proba(X_test)[:,1]
    prob_table[:,i] = probs
    
Y_pred = my_classes[prob_table.argmax(1)]
print("Test accuracy = ", accuracy_score(Y_test, Y_pred) * 100,)

from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(SVC(gamma='auto', C=1000, kernel='rbf'
                              ,class_weight='balanced', random_state=42))
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print("Test accuracy = ", accuracy_score(Y_test, Y_pred) * 100,)

Test accuracy =  61.795918367346935
Test accuracy =  58.93877551020408

Note the difference in OVR based on probabilities, which is more fine grained and yields better results, vs one based on labels.

For further experiments you may wish to wrap classifier into a reusable class:

class OVRBinomial(BaseEstimator, ClassifierMixin):

    def __init__(self, cls):
        self.cls = cls

    def fit(self, X, y, **kwargs):
        self.classes_ = np.unique(y)
        self.clfs_ = []
        for c in self.classes_:
            clf = self.cls(**kwargs)
            clf.fit(X, y == c)
            self.clfs_.append(clf)
        return self

    def predict(self, X, **kwargs):
        probs = np.zeros((len(X), len(self.classes_)))
        for i, c in enumerate(self.classes_):
            prob = self.clfs_[i].predict_proba(X, **kwargs)[:, 1]
            probs[:, i] = prob
        idx_max = np.argmax(probs, 1)
        return self.classes_[idx_max]
like image 70
Sergey Bushmanov Avatar answered Oct 21 '22 14:10

Sergey Bushmanov


There is mistake in the prediction part of your code. With the command Y_pred = prob_table.argmax(axis=1), you have the index of the column with the max of probability. But you want to have the class that has the max of probability not the column index :

Y_pred = my_classes[prob_table.argmax(axis=1)]
like image 30
Pierre-Loic Avatar answered Oct 21 '22 14:10

Pierre-Loic


The basics of the one-vs-rest is to predict the probability for the "one" class (disregard the probability for the "rest" class) and then take the estimator with the highest probability. pandas can do this by taking the .idxmax, which returns the column name with the highest probability.

This should work:

import pandas

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier

# Read/load dataset
dataset = load_wine()
X = dataset["data"]
y = dataset["target"]
classes = {
    key: value
    for key, value in zip(range(len(dataset["target_names"])), dataset["target_names"])
}

# Create a train/test split (training set is 80% of the data, make sure the different classes are balanced across train and test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=43, shuffle=True, stratify=y
)

# Create a set of models
estimators = {}
for class_number, class_name in classes.items():

    # Create a model
    estimator = SVC(
        gamma="auto", C=1000, kernel="rbf", class_weight="balanced", probability=True
    )

    # Fit the model, make sure y is 1 if the class is the target for this estimator, otherwise (rest) 0
    estimator = estimator.fit(
        X_train, [1 if element == class_number else 0 for element in y_train]
    )

    # Store the trained model
    estimators[class_number] = estimator

# Make predictions
prediction_probabilities = {}
for class_number, estimator in estimators.items():

    # Every estimator predicts the probability for their target class
    prediction_probabilities[class_number] = estimator.predict_proba(X_test)[:, 1]

# Combine the probabilities into a dataframe
prediction_probabilities_df = pandas.DataFrame(prediction_probabilities)

# The prediction for each row is the column with the highest probability
y_pred = prediction_probabilities_df.idxmax(axis=1)

# Calculate the test accuracy
accuracy = accuracy_score(y_test, y_pred) * 100
print(f"Test accuracy (custom OneVsRest): {accuracy}")


# Create the model
clf = OneVsRestClassifier(
    SVC(gamma="auto", C=1000, kernel="rbf", class_weight="balanced")
)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate the test accuracy
accuracy = accuracy_score(y_test, y_pred) * 100
print(f"Test accuracy (Scikit-Learn OneVsRest): {accuracy}")

Output:

Test accuracy (custom OneVsRest): 47.22222222222222
Test accuracy (Scikit-Learn OneVsRest): 41.66666666666667
like image 36
Gijs Wobben Avatar answered Oct 21 '22 12:10

Gijs Wobben