Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Manual split versus Scikit Grid Search

I am perplexed by achieving seemingly very different results when relying on a "manual" split of the data between training and test sets and using the scikit-learn grid search function. I am using an evaluation function sourced from a kaggle competition for both runs and the grid search is over a single value (the same value as the manual split). The resulting gini value is so different there has to be an error somewhere but I dont see it and am wondering if there is an oversight I am making in the comparison?

The first code block when ran for me results in gini of just "Validation Sample Score: 0.0033997889 (normalized gini)."

The second block (using scikit) results in much higher values:

Fitting 2 folds for each of 1 candidates, totalling 2 fits
0.334467621189
0.339421569449
[Parallel(n_jobs=-1)]: Done   3 out of   2 | elapsed:  9.9min remaining:  -198.0s
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:  9.9min finished
{'n_estimators': 1000}
0.336944643888
[mean: 0.33694, std: 0.00248, params: {'n_estimators': 1000}]

Eval function:

def gini(solution, submission):
    df = zip(solution, submission)
    df = sorted(df, key=lambda x: (x[1],x[0]), reverse=True)
    rand = [float(i+1)/float(len(df)) for i in range(len(df))]
    totalPos = float(sum([x[0] for x in df]))
    cumPosFound = [df[0][0]]
    for i in range(1,len(df)):
        cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
    Lorentz = [float(x)/totalPos for x in cumPosFound]
    Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
    return sum(Gini)

def normalized_gini(solution, submission):
    normalized_gini = gini(solution, submission)/gini(solution, solution)
    print normalized_gini
    return normalized_gini


gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)

Block 1:

if __name__ == '__main__':

    dat=pd.read_table('train.csv',sep=",")

    y=dat[['Hazard']].values.ravel()
    dat=dat.drop(['Hazard','Id'],axis=1)

    #sample out 30% for validation
    folds=train_test_split(range(len(y)),test_size=0.3) #30% test
    train_X=dat.iloc[folds[0],:]
    train_y=y[folds[0]]
    test_X=dat.iloc[folds[1],:]
    test_y=y[folds[1]]


    #assume no leakage by OH whole data
    dat_dict=train_X.T.to_dict().values()
    vectorizer = DV( sparse = False )
    vectorizer.fit( dat_dict )
    train_X = vectorizer.transform( dat_dict )

    del dat_dict

    dat_dict=test_X.T.to_dict().values()
    test_X = vectorizer.transform( dat_dict )

    del dat_dict



    rf=RandomForestRegressor(n_estimators=1000, n_jobs=-1)
    rf.fit(train_X,train_y)
    y_submission=rf.predict(test_X)
    print "Validation Sample Score: %.10f (normalized gini)." % normalized_gini(test_y,y_submission)

Block 2:

dat_dict=dat.T.to_dict().values()
vectorizer = DV( sparse = False )
vectorizer.fit( dat_dict )
X = vectorizer.transform( dat_dict )

parameters= {'n_estimators': [1000]}
grid_search = GridSearchCV(RandomForestRegressor(), param_grid=parameters,cv=2, verbose=1, scoring=gini_scorer,n_jobs=-1)
grid_search.fit(X,y)

print grid_search.best_params_
print grid_search.best_score_
print grid_search.grid_scores_

EDIT

Here is a self contained example where I am getting the same sort of difference.

from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston




if __name__ == '__main__':

    b=load_boston()
    X = pd.DataFrame(b.data)
    y = b.target

    #sample out 30% for validation
    folds=train_test_split(range(len(y)),test_size=0.5) #50% test
    train_X=X.iloc[folds[0],:]
    train_y=y[folds[0]]
    test_X=X.iloc[folds[1],:]
    test_y=y[folds[1]]


    rf=RandomForestRegressor(n_estimators=1000, n_jobs=-1)
    rf.fit(train_X,train_y)
    y_submission=rf.predict(test_X)

    print "Validation Sample Score: %.10f (mean squared)." % mean_squared_error(test_y,y_submission)


    parameters= {'n_estimators': [1000]}
    grid_search = GridSearchCV(RandomForestRegressor(), param_grid=parameters,cv=2, verbose=1, scoring='mean_squared_error',n_jobs=-1)
    grid_search.fit(X,y)

    print grid_search.best_params_
    print grid_search.best_score_
    print grid_search.grid_scores_
like image 316
B_Miner Avatar asked Jul 13 '15 15:07

B_Miner


People also ask

Should I use GridSearchCV?

In summary, you should only use gridsearch on the training data after doing the train/test split, if you want to use the performance of the model on the test set as a metric for how your model will perform when it really does see new data.

What is Sklearn grid search?

Grid search is a method for performing hyper-parameter optimisation, that is, with a given model (e.g. a CNN) and test dataset, it is a method for finding the optimal combination of hyper-parameters (an example of a hyper-parameter is the learning rate of the optimiser).

What is the difference between GridSearchCV and RandomizedSearchCV?

The only difference between both the approaches is in grid search we define the combinations and do training of the model whereas in RandomizedSearchCV the model selects the combinations randomly. Both are very effective ways of tuning the parameters that increase the model generalizability.

Is grid search the same as cross-validation?

Cross-validation is a method for robustly estimating test-set performance (generalization) of a model. Grid-search is a way to select the best of a family of models, parametrized by a grid of parameters.


2 Answers

Not sure I can provide you with a complete solution but here are some pointers:

  1. Use random_state parameter of scikit-learn objects when debugging this kind of issue as it makes your results really reproducible. The following will always return exactly the same number:

    rf=RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=0)
    rf.fit(train_X,train_y)
    y_submission=rf.predict(test_X)
    mean_squared_error(test_y,y_submission)
    

It resets the random number generator to make sure that you always get "the same randomness". You should be using it on train_test_split and GridSearchCV too.

  1. The results you get on the self-contained example are normal. Typically I got:

    Validation Sample Score: 9.8136434847 (mean squared).
    [mean: -22.38918, std: 11.56372, params: {'n_estimators': 1000}]
    

First, note that the mean squared error returned from GridSearchCV is a negated mean squared error. I think this is by design to keep the spirit of a score function (for a score, greater is better).

Now this is still 9.81 against 22.38. However here the standard deviation is HUGE. It can explain that the scores look so different. If you want to check that GridSearchCV is not doing something dubious you can force it to use one split only, and the same as your manual split:

from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston

if __name__ == '__main__':
    b=load_boston()
    X = pd.DataFrame(b.data)
    y = b.target
    folds=train_test_split(range(len(y)),test_size=0.5, random_state=15) #50% test
    folds_split = np.ones_like(y)
    folds_split[folds[0]] = -1
    ps = PredefinedSplit(folds_split)

    for tr, te in ps:
        train_X=X.iloc[tr,:]
        train_y=y[tr]
        test_X=X.iloc[te,:]
        test_y=y[te]
        rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
        rf.fit(train_X,train_y)
        y_submission=rf.predict(test_X)
        print("Validation Sample Score: {:.10f} (mean squared).".format(mean_squared_error(test_y, y_submission)))

    parameters= {'n_estimators': [1000], 'n_jobs': [1], 'random_state': [15]}
    grid_search = GridSearchCV(RandomForestRegressor(), param_grid=parameters,cv=ps, verbose=2, scoring='mean_squared_error', n_jobs=1)
    grid_search.fit(X,y)

    print("best_params: ", grid_search.best_params_)
    print("best_score", grid_search.best_score_)
    print("grid_scores", grid_search.grid_scores_)

Hope this helps a bit.

Sorry I can't figure out what's going on with your Gini scorer. I'd say 0.0033xxx seems like a very low value though (almost no model at all?) for a normalized gini score.

like image 53
ldirer Avatar answered Sep 18 '22 22:09

ldirer


Following your minimal example and response from user3914041 and Andreus, this works as intended. Indeed, I got :

Validation Sample Score: 10.176958 (mean squared).
Fitting 1 folds for each of 1 candidates, totalling 1 fits
mean: 10.19074, std: 0.00000, params: {'n_estimators': 1000}

In this case we have the same result in both methodologies (omitting some rounding). Here is the code to reproduce the same scores:

from sklearn.cross_validation import train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from sklearn import metrics
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.datasets import load_boston

b=load_boston()
X = b.data
y = b.target

folds=train_test_split(range(len(y)),test_size=0.5, random_state=10)
train_X=X[folds[0],:]
train_y=y[folds[0]]
test_X=X[folds[1],:]
test_y=y[folds[1]]

folds_split = np.zeros_like(y)
folds_split[folds[0]] = -1
ps = PredefinedSplit(folds_split)

rf=RandomForestRegressor(n_estimators=1000, random_state=42)
rf.fit(train_X,train_y)
y_submission=rf.predict(test_X)

print "Validation Sample Score: %f (mean squared)." % mean_squared_error(test_y,y_submission)

mse_scorer = make_scorer(mean_squared_error)
parameters= {'n_estimators': [1000]}
grid_search = GridSearchCV(RandomForestRegressor(random_state=42), cv=ps,
                           param_grid=parameters, verbose=1, scoring=mse_scorer)
grid_search.fit(X,y)

print grid_search.grid_scores_[0]

In your first example, try to remove greater_is_better=True. Indeed, the Gini coefficient is supposed to be minimized, not maximized.

Try to see if this solves the problem. You can also add some random seed to ensure your split are done in the exact same fashion.

like image 28
Challensois Avatar answered Sep 19 '22 22:09

Challensois