I was trying to verify that I had correctly understood how SVM - OVA (One-versus-All) works, by comparing the function OneVsRestClassifier
with my own implementation.
In the following code, I implemented num_classes
classifiers in the training phase, and then tested all of them on the testset and selected the one returning the highest probability value.
import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import scale
# Read dataset
df = pd.read_csv('In/winequality-white.csv', delimiter=';')
X = df.loc[:, df.columns != 'quality']
Y = df.loc[:, df.columns == 'quality']
my_classes = np.unique(Y)
num_classes = len(my_classes)
# Train-test split
np.random.seed(42)
msk = np.random.rand(len(df)) <= 0.8
train = df[msk]
test = df[~msk]
# From dataset to features and labels
X_train = train.loc[:, train.columns != 'quality']
Y_train = train.loc[:, train.columns == 'quality']
X_test = test.loc[:, test.columns != 'quality']
Y_test = test.loc[:, test.columns == 'quality']
# Models
clf = [None] * num_classes
for k in np.arange(0,num_classes):
my_model = SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced', probability=True)
clf[k] = my_model.fit(X_train, Y_train==my_classes[k])
# Prediction
prob_table = np.zeros((len(Y_test), num_classes))
for k in np.arange(0,num_classes):
p = clf[k].predict_proba(X_test)
prob_table[:,k] = p[:,list(clf[k].classes_).index(True)]
Y_pred = prob_table.argmax(axis=1)
print("Test accuracy = ", accuracy_score( Y_test, Y_pred) * 100,"\n\n")
Test accuracy is equal to 0.21, while when using the function OneVsRestClassifier
, it returns 0.59. For completeness, I also report the other code (the pre-processing steps are the same as before):
....
clf = OneVsRestClassifier(SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced'))
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print("Test accuracy = ", accuracy_score( Y_test, Y_pred) * 100,"\n\n")
Is there something wrong in my own implementation of SVM - OVA?
In its most basic type, SVM doesn't support multiclass classification. For multiclass classification, the same principle is utilized after breaking down the multi-classification problem into smaller subproblems, all of which are binary classification problems.
Instead, heuristic methods can be used to split a multi-class classification problem into multiple binary classification datasets and train a binary classification model each. Two examples of these heuristic methods include: One-vs-Rest (OvR) One-vs-One (OvO)
The goal is to create a classification model that can predict multiple classes, by using the one-versus-one approach. This component is useful for creating models that predict three or more possible outcomes, when the outcome depends on continuous or categorical predictor variables.
You can use the kernel_approximation module to scale up SVMs to a large number of samples like this. Show activity on this post. This will use all available CPUs on your Computer, while still doing the same computation as before. Would you pass the n_jobs parameter to the OVR Classifier or to the Bagging Classifier ?
Is there something wrong in my own implementation of SVM - OVA?
You have unique classes array([3, 4, 5, 6, 7, 8, 9])
, however the line Y_pred = prob_table.argmax(axis=1)
assumes they are 0-indexed.
Try refactoring your code to be less error prone to assumptions like that:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
df = pd.read_csv('winequality-white.csv', delimiter=';')
y = df["quality"]
my_classes = np.unique(y)
X = df.drop("quality", axis=1)
X_train, X_test, Y_train, Y_test = train_test_split(X,y, random_state=42)
# Models
clfs = []
for k in my_classes:
my_model = SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced'
, probability=True, random_state=42)
clfs.append(my_model.fit(X_train, Y_train==k))
# Prediction
prob_table = np.zeros((len(X_test),len(my_classes)))
for i,clf in enumerate(clfs):
probs = clf.predict_proba(X_test)[:,1]
prob_table[:,i] = probs
Y_pred = my_classes[prob_table.argmax(1)]
print("Test accuracy = ", accuracy_score(Y_test, Y_pred) * 100,)
from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(SVC(gamma='auto', C=1000, kernel='rbf'
,class_weight='balanced', random_state=42))
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print("Test accuracy = ", accuracy_score(Y_test, Y_pred) * 100,)
Test accuracy = 61.795918367346935
Test accuracy = 58.93877551020408
Note the difference in OVR based on probabilities, which is more fine grained and yields better results, vs one based on labels.
For further experiments you may wish to wrap classifier into a reusable class:
class OVRBinomial(BaseEstimator, ClassifierMixin):
def __init__(self, cls):
self.cls = cls
def fit(self, X, y, **kwargs):
self.classes_ = np.unique(y)
self.clfs_ = []
for c in self.classes_:
clf = self.cls(**kwargs)
clf.fit(X, y == c)
self.clfs_.append(clf)
return self
def predict(self, X, **kwargs):
probs = np.zeros((len(X), len(self.classes_)))
for i, c in enumerate(self.classes_):
prob = self.clfs_[i].predict_proba(X, **kwargs)[:, 1]
probs[:, i] = prob
idx_max = np.argmax(probs, 1)
return self.classes_[idx_max]
There is mistake in the prediction part of your code. With the command Y_pred = prob_table.argmax(axis=1)
, you have the index of the column with the max of probability. But you want to have the class that has the max of probability not the column index :
Y_pred = my_classes[prob_table.argmax(axis=1)]
The basics of the one-vs-rest is to predict the probability for the "one" class (disregard the probability for the "rest" class) and then take the estimator with the highest probability. pandas
can do this by taking the .idxmax
, which returns the column name with the highest probability.
This should work:
import pandas
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier
# Read/load dataset
dataset = load_wine()
X = dataset["data"]
y = dataset["target"]
classes = {
key: value
for key, value in zip(range(len(dataset["target_names"])), dataset["target_names"])
}
# Create a train/test split (training set is 80% of the data, make sure the different classes are balanced across train and test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, train_size=0.8, random_state=43, shuffle=True, stratify=y
)
# Create a set of models
estimators = {}
for class_number, class_name in classes.items():
# Create a model
estimator = SVC(
gamma="auto", C=1000, kernel="rbf", class_weight="balanced", probability=True
)
# Fit the model, make sure y is 1 if the class is the target for this estimator, otherwise (rest) 0
estimator = estimator.fit(
X_train, [1 if element == class_number else 0 for element in y_train]
)
# Store the trained model
estimators[class_number] = estimator
# Make predictions
prediction_probabilities = {}
for class_number, estimator in estimators.items():
# Every estimator predicts the probability for their target class
prediction_probabilities[class_number] = estimator.predict_proba(X_test)[:, 1]
# Combine the probabilities into a dataframe
prediction_probabilities_df = pandas.DataFrame(prediction_probabilities)
# The prediction for each row is the column with the highest probability
y_pred = prediction_probabilities_df.idxmax(axis=1)
# Calculate the test accuracy
accuracy = accuracy_score(y_test, y_pred) * 100
print(f"Test accuracy (custom OneVsRest): {accuracy}")
# Create the model
clf = OneVsRestClassifier(
SVC(gamma="auto", C=1000, kernel="rbf", class_weight="balanced")
)
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
# Calculate the test accuracy
accuracy = accuracy_score(y_test, y_pred) * 100
print(f"Test accuracy (Scikit-Learn OneVsRest): {accuracy}")
Output:
Test accuracy (custom OneVsRest): 47.22222222222222
Test accuracy (Scikit-Learn OneVsRest): 41.66666666666667
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With