Something wrong when implementing SVM One-vs-all in python

Tags:

I was trying to verify that I had correctly understood how SVM - OVA (One-versus-All) works, by comparing the function OneVsRestClassifier with my own implementation.

In the following code, I implemented num_classes classifiers in the training phase, and then tested all of them on the testset and selected the one returning the highest probability value.

import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import scale

# Read dataset 
df = pd.read_csv('In/winequality-white.csv',  delimiter=';')
X = df.loc[:, df.columns != 'quality']
Y = df.loc[:, df.columns == 'quality']
my_classes = np.unique(Y)
num_classes = len(my_classes)

# Train-test split
np.random.seed(42)
msk = np.random.rand(len(df)) <= 0.8
train = df[msk]
test = df[~msk]

# From dataset to features and labels
X_train = train.loc[:, train.columns != 'quality']
Y_train = train.loc[:, train.columns == 'quality']
X_test = test.loc[:, test.columns != 'quality']
Y_test = test.loc[:, test.columns == 'quality']

# Models
clf =  [None] * num_classes
for k in np.arange(0,num_classes):
    my_model = SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced', probability=True)
    clf[k] = my_model.fit(X_train, Y_train==my_classes[k])

# Prediction
prob_table = np.zeros((len(Y_test), num_classes))
for k in np.arange(0,num_classes):
    p = clf[k].predict_proba(X_test)
    prob_table[:,k] = p[:,list(clf[k].classes_).index(True)]
Y_pred = prob_table.argmax(axis=1)

print("Test accuracy = ", accuracy_score( Y_test, Y_pred) * 100,"\n\n")

Test accuracy is equal to 0.21, while when using the function OneVsRestClassifier, it returns 0.59. For completeness, I also report the other code (the pre-processing steps are the same as before):

....
clf = OneVsRestClassifier(SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced'))
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print("Test accuracy = ", accuracy_score( Y_test, Y_pred) * 100,"\n\n")

Is there something wrong in my own implementation of SVM - OVA?

628

asked Dec 15 '20 10:12

Video Answer

3 Answers

Is there something wrong in my own implementation of SVM - OVA?

You have unique classes array([3, 4, 5, 6, 7, 8, 9]), however the line Y_pred = prob_table.argmax(axis=1) assumes they are 0-indexed.

Try refactoring your code to be less error prone to assumptions like that:

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score,classification_report
from sklearn.preprocessing import scale
from sklearn.model_selection import train_test_split
df = pd.read_csv('winequality-white.csv',  delimiter=';')
y = df["quality"]
my_classes = np.unique(y)
X = df.drop("quality", axis=1)

X_train, X_test, Y_train, Y_test = train_test_split(X,y, random_state=42)

# Models
clfs =  []

for k in my_classes:
    my_model = SVC(gamma='auto', C=1000, kernel='rbf', class_weight='balanced'
                   , probability=True, random_state=42)
    clfs.append(my_model.fit(X_train, Y_train==k))

# Prediction
prob_table = np.zeros((len(X_test),len(my_classes)))

for i,clf in enumerate(clfs):
    probs = clf.predict_proba(X_test)[:,1]
    prob_table[:,i] = probs
    
Y_pred = my_classes[prob_table.argmax(1)]
print("Test accuracy = ", accuracy_score(Y_test, Y_pred) * 100,)

from sklearn.multiclass import OneVsRestClassifier
clf = OneVsRestClassifier(SVC(gamma='auto', C=1000, kernel='rbf'
                              ,class_weight='balanced', random_state=42))
clf.fit(X_train, Y_train)
Y_pred = clf.predict(X_test)
print("Test accuracy = ", accuracy_score(Y_test, Y_pred) * 100,)

Test accuracy =  61.795918367346935
Test accuracy =  58.93877551020408

Note the difference in OVR based on probabilities, which is more fine grained and yields better results, vs one based on labels.

For further experiments you may wish to wrap classifier into a reusable class:

class OVRBinomial(BaseEstimator, ClassifierMixin):

    def __init__(self, cls):
        self.cls = cls

    def fit(self, X, y, **kwargs):
        self.classes_ = np.unique(y)
        self.clfs_ = []
        for c in self.classes_:
            clf = self.cls(**kwargs)
            clf.fit(X, y == c)
            self.clfs_.append(clf)
        return self

    def predict(self, X, **kwargs):
        probs = np.zeros((len(X), len(self.classes_)))
        for i, c in enumerate(self.classes_):
            prob = self.clfs_[i].predict_proba(X, **kwargs)[:, 1]
            probs[:, i] = prob
        idx_max = np.argmax(probs, 1)
        return self.classes_[idx_max]

answered Oct 21 '22 14:10

There is mistake in the prediction part of your code. With the command Y_pred = prob_table.argmax(axis=1), you have the index of the column with the max of probability. But you want to have the class that has the max of probability not the column index :

Y_pred = my_classes[prob_table.argmax(axis=1)]

answered Oct 21 '22 14:10

Pierre-Loic

The basics of the one-vs-rest is to predict the probability for the "one" class (disregard the probability for the "rest" class) and then take the estimator with the highest probability. pandas can do this by taking the .idxmax, which returns the column name with the highest probability.

This should work:

import pandas

from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.multiclass import OneVsRestClassifier

# Read/load dataset
dataset = load_wine()
X = dataset["data"]
y = dataset["target"]
classes = {
    key: value
    for key, value in zip(range(len(dataset["target_names"])), dataset["target_names"])
}

# Create a train/test split (training set is 80% of the data, make sure the different classes are balanced across train and test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.8, random_state=43, shuffle=True, stratify=y
)

# Create a set of models
estimators = {}
for class_number, class_name in classes.items():

    # Create a model
    estimator = SVC(
        gamma="auto", C=1000, kernel="rbf", class_weight="balanced", probability=True
    )

    # Fit the model, make sure y is 1 if the class is the target for this estimator, otherwise (rest) 0
    estimator = estimator.fit(
        X_train, [1 if element == class_number else 0 for element in y_train]
    )

    # Store the trained model
    estimators[class_number] = estimator

# Make predictions
prediction_probabilities = {}
for class_number, estimator in estimators.items():

    # Every estimator predicts the probability for their target class
    prediction_probabilities[class_number] = estimator.predict_proba(X_test)[:, 1]

# Combine the probabilities into a dataframe
prediction_probabilities_df = pandas.DataFrame(prediction_probabilities)

# The prediction for each row is the column with the highest probability
y_pred = prediction_probabilities_df.idxmax(axis=1)

# Calculate the test accuracy
accuracy = accuracy_score(y_test, y_pred) * 100
print(f"Test accuracy (custom OneVsRest): {accuracy}")


# Create the model
clf = OneVsRestClassifier(
    SVC(gamma="auto", C=1000, kernel="rbf", class_weight="balanced")
)
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate the test accuracy
accuracy = accuracy_score(y_test, y_pred) * 100
print(f"Test accuracy (Scikit-Learn OneVsRest): {accuracy}")

Output:

Test accuracy (custom OneVsRest): 47.22222222222222
Test accuracy (Scikit-Learn OneVsRest): 41.66666666666667

answered Oct 21 '22 12:10

Gijs Wobben

Related questions
                            
                                sort pandas dataframe by sum of columns
                            
                                AttributeError: module 'tensorflow_core._api.v2.image' has no attribute 'resize_images'
                            
                                How do we kill the process spawned by subprocess.call() function in python?
                            
                                Export Plotly Dash datatable output to a CSV by clicking download link
                            
                                How can I add grid lines to a catplot in seaborn?
                            
                                The 'google-api-python-client' distribution was not found and is required by the application with pyinstaller
                            
                                Can you plot interquartile range as the error band on a seaborn lineplot?
                            
                                List comprehension for multiplying each string in a list by numbers from given range
                            
                                How do I chain the movement of a snake's body?
                            
                                Django shell_plus: How to access Jupyter notebook in Docker Container
                            
                                matplotlib: values for the (xx-small, x-small, small, medium, large, x-large, xx-large, larger, smaller) special sizes
                            
                                Is there a __dunder__ method corresponding to |= (pipe equal/update) for dicts in python 3.9?
                            
                                ModuleNotFoundError: No module named 'tf_slim'
                            
                                Plotting a pie chart out of a dictionary
                            
                                Pandas - check if dataframe has negative value in any column
                            
                                How can I reduce the number of conditions in a statement? [duplicate]
                            
                                no such option: --use-feature while installing tensorflow object detection api
                            
                                OpenCV Probabilistic Hough Line Transform giving different results with C++ and Python?
                            
                                Recover from segfault in Python
                            
                                How can I get the title of the currently playing media in windows 10 with python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Something wrong when implementing SVM One-vs-all in python

Tags:

python

svm

scikit-learn

multiclass-classification

Alessandro

People also ask

Video Answer

3 Answers

Sergey Bushmanov

Pierre-Loic

Gijs Wobben

Recent Activity

Donate For Us