Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to nest LabelKFold?

I have a dataset with ~300 points and 32 distinct labels and I want to evaluate a LinearSVR model by plotting its learning curve using grid search and LabelKFold validation.

The code I have looks like this:

import numpy as np
from sklearn import preprocessing
from sklearn.svm import LinearSVR
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import LabelKFold
from sklearn.grid_search import GridSearchCV
from sklearn.learning_curve import learning_curve
    ...
#get data (x, y, labels)
    ...
C_space = np.logspace(-3, 3, 10)
epsilon_space = np.logspace(-3, 3, 10)  

svr_estimator = Pipeline([
    ("scale", preprocessing.StandardScaler()),
    ("svr", LinearSVR),
])

search_params = dict(
    svr__C = C_space,
    svr__epsilon = epsilon_space
)

kfold = LabelKFold(labels, 5)

svr_search = GridSearchCV(svr_estimator, param_grid = search_params, cv = ???)

train_space = np.linspace(.5, 1, 10)
train_sizes, train_scores, valid_scores = learning_curve(svr_search, x, y, train_sizes = train_space, cv = ???, n_jobs = 4)
    ...
#plot learning curve

My question is how to setup the cv attribute for the grid search and learning curve so that it will break my original set into training and test sets that don't share any labels for computing the learning curve. And then from those training sets, further separate them into training and test sets without sharing labels for the grid search?

Essentially, how do I run a nested LabelKFold?


I, the user who created the bounty for this question, wrote the following reproducible example using data available from sklearn.

import numpy as np
from sklearn.datasets import load_digits
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import make_scorer, roc_auc_score
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import cross_val_score, LabelKFold

digits = load_digits()
X = digits['data']
Y = digits['target']
Z = np.zeros_like(Y) ## this is just to make a 2-class problem, purely for the sake of an example
Z[np.where(Y>4)]=1

strata = [x % 13 for x in xrange(Y.size)] # define the strata for use in

## define stuff for nested cv...
mtry = [5, 10]
tuned_par = {'max_features': mtry}
toy_rf = RandomForestClassifier(n_estimators=10, max_depth=10, random_state=10,
                                class_weight="balanced")
roc_auc_scorer = make_scorer(roc_auc_score, needs_threshold=True)

## define outer k-fold label-aware cv
outer_cv = LabelKFold(labels=strata, n_folds=5)

#############################################################################
##  this works: using regular randomly-allocated 10-fold CV in the inner folds
#############################################################################
vanilla_clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer,
                        cv=5, n_jobs=1)
vanilla_results = cross_val_score(vanilla_clf, X=X, y=Z, cv=outer_cv, n_jobs=1)

##########################################################################
##  this does not work: attempting to use label-aware CV in the inner loop
##########################################################################
inner_cv = LabelKFold(labels=strata, n_folds=5)
nested_kfold_clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer,
                                cv=inner_cv, n_jobs=1)
nested_kfold_results = cross_val_score(nested_kfold_clf, X=X, y=Y, cv=outer_cv, n_jobs=1)
like image 388
Alex Avatar asked Jun 25 '16 00:06

Alex


1 Answers

From your question, you are looking for the LabelKFold score on your data, while grid-searching the parameters of your pipeline in each of the iterations of this outer LabelKFold, using again a LabelKFold. Although I was not able to achieve that out-of-the-box it takes only one loop:

outer_cv = LabelKFold(labels=strata, n_folds=3)
strata = np.array(strata)
scores = []
for outer_train, outer_test in outer_cv:
    print "Outer set. Train:", set(strata[outer_train]), "\tTest:", set(strata[outer_test])
    inner_cv = LabelKFold(labels=strata[outer_train], n_folds=3)
    print "\tInner:"
    for inner_train, inner_test in inner_cv:
        print "\t\tTrain:", set(strata[outer_train][inner_train]), "\tTest:", set(strata[outer_train][inner_test])
    clf = GridSearchCV(estimator=toy_rf, param_grid=tuned_par, scoring=roc_auc_scorer, cv= inner_cv, n_jobs=1)
    clf.fit(X[outer_train],Z[outer_train])
    scores.append(clf.score(X[outer_test], Z[outer_test]))

Running the code, the first iteration yields:

Outer set. Train: set([0, 1, 4, 5, 7, 8, 10, 11])   Test: set([9, 2, 3, 12, 6])
Inner:
    Train: set([0, 10, 11, 5, 7])   Test: set([8, 1, 4])
    Train: set([1, 4, 5, 8, 10, 11])    Test: set([0, 7])
    Train: set([0, 1, 4, 8, 7])     Test: set([10, 11, 5])

Hence, it is easy to verify that it executes as intended. Your cross-validation scores are in the list scores and you can easily process them. I have used the variables, e.g., strata you defined in your last piece of code.

like image 60
geompalik Avatar answered Oct 17 '22 14:10

geompalik