Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using ROC AUC score with Logistic Regression and Iris Dataset

What I need is to:

  • Apply a logistic regression classifier
  • Report the per-class ROC using the AUC.
  • Use the estimated probabilities of the logistic regression to guide the construction of the ROC.
  • 5fold cross validation for the training your model.

For this, my approach was to use this really nice tutorial:

From his idea and method I simply changed how I obtain the raw data which I am getting like this:

df = pd.read_csv(
    filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
    header=None, 
    sep=',')

df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end

df.tail()

# split data table into data X and class labels y
X = df.iloc[:,0:4].values
Y = df.iloc[:,4].values

Them I simply run the code. If I try to run for metrics like accuracy or
balanced_accuracy everything works fine (even with many other metrics). My problem is that when I try to run with the metric roc_auc I get the error:

"ValueError: Only one class present in y_true. ROC AUC score is not defined in that case."

This error have been discussed here1, here2, here3, and here4. However, I was not able to use any of the "solution"/work arounds provided by them to solve my problem.

My whole code is:

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
import pandas as pd
import numpy as np
import plotly.graph_objs as go
from plotly.offline import init_notebook_mode
from sklearn.preprocessing import StandardScaler
from IPython import get_ipython
get_ipython().run_line_magic('matplotlib', 'qt')
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split


df = pd.read_csv(
    filepath_or_buffer='https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', 
    header=None, 
    sep=',')

df.columns=['sepal_len', 'sepal_wid', 'petal_len', 'petal_wid', 'class']
df.dropna(how="all", inplace=True) # drops the empty line at file-end

df.tail()

# split data table into data X and class labels y
X = df.iloc[:,0:4].values
Y = df.iloc[:,4].values

#print(X)
#print(Y)


seed = 7

# prepare models
models = []
models.append(('LR', LogisticRegression()))

# evaluate each model in turn
results = []
names = []
scoring = 'roc_auc'
for name, model in models:
    kfold = model_selection.KFold(n_splits=5, random_state=seed)
    cv_results = model_selection.cross_val_score(model, X, Y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)



# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
like image 381
Fsoares Avatar asked May 02 '19 00:05

Fsoares


People also ask

Can ROC AUC be used in regression?

The Area Under the ROC curve (AUC) is an aggregated metric that evaluates how well a logistic regression model classifies positive and negative outcomes at all possible cutoffs. It can range from 0.5 to 1, and the larger it is the better.

Can we use AUC ROC curve for multi class model?

Like I said before, the AUC-ROC curve is only for binary classification problems. But we can extend it to multiclass classification problems by using the One vs All technique.

What is a good AUC for logistic regression?

The area under the ROC curve (AUC) results were considered excellent for AUC values between 0.9-1, good for AUC values between 0.8-0.9, fair for AUC values between 0.7-0.8, poor for AUC values between 0.6-0.7 and failed for AUC values between 0.5-0.6.


2 Answers

The iris dataset is usually ordered with respect to classes. Hence, when you split without shuffling, the test dataset might get only one class.

One simple solution would be using shuffle parameter.

kfold = model_selection.KFold(n_splits=10, shuffle=True, random_state=seed)

Even then roc_auc does not support multi-class format directly (iris - dataset has three classes).

Go through this link to know more information about how to use roc_auc for multi-class situation.

like image 59
Venkatachalam Avatar answered Oct 21 '22 08:10

Venkatachalam


Ideally, for classification tasks, a stratified-k-fold iteration is used which preserves the balance of classes in train and test folds.

In the scikit-learn cross_val_score, the default behaviour of cross-validation is dependent on the task. The documentation says:-

cv : int, cross-validation generator or an iterable, optional
Determines the cross-validation splitting strategy. Possible inputs for cv are:
  • None, to use the default 3-fold cross validation,
  • integer, to specify the number of folds in a (Stratified)KFold, CV splitter,
  • An iterable yielding (train, test) splits as arrays of indices.

  • For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

Now the iris dataset is a set of 150 samples which are ordered by classes (Iris setosa, Iris virginica and Iris versicolor). So using a simple K-fold iterator of 5 folds will treat first 120 samples in training set and last 30 samples in test set. Last 30 samples belong to the single Iris versicolor class.

So if you do not have any specific reason to use the KFold then you can do this:

cv_results = model_selection.cross_val_score(model, X, Y, cv=5, scoring=scoring)

But now comes the issue of scoring. You are using 'roc_auc' which is only defined for binary classification tasks. So either choose a different metric in place of roc_auc, or else specify which class you want to treat as positive and which other classes as negative.

like image 32
Vivek Kumar Avatar answered Oct 21 '22 08:10

Vivek Kumar