Manual split versus Scikit Grid Search

Tags:

I am perplexed by achieving seemingly very different results when relying on a "manual" split of the data between training and test sets and using the scikit-learn grid search function. I am using an evaluation function sourced from a kaggle competition for both runs and the grid search is over a single value (the same value as the manual split). The resulting gini value is so different there has to be an error somewhere but I dont see it and am wondering if there is an oversight I am making in the comparison?

The first code block when ran for me results in gini of just "Validation Sample Score: 0.0033997889 (normalized gini)."

The second block (using scikit) results in much higher values:

Fitting 2 folds for each of 1 candidates, totalling 2 fits
0.334467621189
0.339421569449
[Parallel(n_jobs=-1)]: Done   3 out of   2 | elapsed:  9.9min remaining:  -198.0s
[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:  9.9min finished
{'n_estimators': 1000}
0.336944643888
[mean: 0.33694, std: 0.00248, params: {'n_estimators': 1000}]

Eval function:

def gini(solution, submission):
    df = zip(solution, submission)
    df = sorted(df, key=lambda x: (x[1],x[0]), reverse=True)
    rand = [float(i+1)/float(len(df)) for i in range(len(df))]
    totalPos = float(sum([x[0] for x in df]))
    cumPosFound = [df[0][0]]
    for i in range(1,len(df)):
        cumPosFound.append(cumPosFound[len(cumPosFound)-1] + df[i][0])
    Lorentz = [float(x)/totalPos for x in cumPosFound]
    Gini = [Lorentz[i]-rand[i] for i in range(len(df))]
    return sum(Gini)

def normalized_gini(solution, submission):
    normalized_gini = gini(solution, submission)/gini(solution, solution)
    print normalized_gini
    return normalized_gini


gini_scorer = metrics.make_scorer(normalized_gini, greater_is_better = True)

Block 1:

if __name__ == '__main__':

    dat=pd.read_table('train.csv',sep=",")

    y=dat[['Hazard']].values.ravel()
    dat=dat.drop(['Hazard','Id'],axis=1)

    #sample out 30% for validation
    folds=train_test_split(range(len(y)),test_size=0.3) #30% test
    train_X=dat.iloc[folds[0],:]
    train_y=y[folds[0]]
    test_X=dat.iloc[folds[1],:]
    test_y=y[folds[1]]


    #assume no leakage by OH whole data
    dat_dict=train_X.T.to_dict().values()
    vectorizer = DV( sparse = False )
    vectorizer.fit( dat_dict )
    train_X = vectorizer.transform( dat_dict )

    del dat_dict

    dat_dict=test_X.T.to_dict().values()
    test_X = vectorizer.transform( dat_dict )

    del dat_dict



    rf=RandomForestRegressor(n_estimators=1000, n_jobs=-1)
    rf.fit(train_X,train_y)
    y_submission=rf.predict(test_X)
    print "Validation Sample Score: %.10f (normalized gini)." % normalized_gini(test_y,y_submission)

Block 2:

dat_dict=dat.T.to_dict().values()
vectorizer = DV( sparse = False )
vectorizer.fit( dat_dict )
X = vectorizer.transform( dat_dict )

parameters= {'n_estimators': [1000]}
grid_search = GridSearchCV(RandomForestRegressor(), param_grid=parameters,cv=2, verbose=1, scoring=gini_scorer,n_jobs=-1)
grid_search.fit(X,y)

print grid_search.best_params_
print grid_search.best_score_
print grid_search.grid_scores_

EDIT

Here is a self contained example where I am getting the same sort of difference.

from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston




if __name__ == '__main__':

    b=load_boston()
    X = pd.DataFrame(b.data)
    y = b.target

    #sample out 30% for validation
    folds=train_test_split(range(len(y)),test_size=0.5) #50% test
    train_X=X.iloc[folds[0],:]
    train_y=y[folds[0]]
    test_X=X.iloc[folds[1],:]
    test_y=y[folds[1]]


    rf=RandomForestRegressor(n_estimators=1000, n_jobs=-1)
    rf.fit(train_X,train_y)
    y_submission=rf.predict(test_X)

    print "Validation Sample Score: %.10f (mean squared)." % mean_squared_error(test_y,y_submission)


    parameters= {'n_estimators': [1000]}
    grid_search = GridSearchCV(RandomForestRegressor(), param_grid=parameters,cv=2, verbose=1, scoring='mean_squared_error',n_jobs=-1)
    grid_search.fit(X,y)

    print grid_search.best_params_
    print grid_search.best_score_
    print grid_search.grid_scores_

316

asked Jul 13 '15 15:07

B_Miner

2 Answers

Not sure I can provide you with a complete solution but here are some pointers:

Use random_state parameter of scikit-learn objects when debugging this kind of issue as it makes your results really reproducible. The following will always return exactly the same number:
```
rf=RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=0)
rf.fit(train_X,train_y)
y_submission=rf.predict(test_X)
mean_squared_error(test_y,y_submission)
```

It resets the random number generator to make sure that you always get "the same randomness". You should be using it on train_test_split and GridSearchCV too.

The results you get on the self-contained example are normal. Typically I got:

Validation Sample Score: 9.8136434847 (mean squared).
[mean: -22.38918, std: 11.56372, params: {'n_estimators': 1000}]

First, note that the mean squared error returned from GridSearchCV is a negated mean squared error. I think this is by design to keep the spirit of a score function (for a score, greater is better).

Now this is still 9.81 against 22.38. However here the standard deviation is HUGE. It can explain that the scores look so different. If you want to check that GridSearchCV is not doing something dubious you can force it to use one split only, and the same as your manual split:

from sklearn.cross_validation import StratifiedKFold, KFold, ShuffleSplit,train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor , ExtraTreesRegressor, GradientBoostingRegressor
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd
from sklearn.feature_extraction import DictVectorizer as DV
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.grid_search import GridSearchCV,RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
from scipy.stats import randint, uniform
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_boston

if __name__ == '__main__':
    b=load_boston()
    X = pd.DataFrame(b.data)
    y = b.target
    folds=train_test_split(range(len(y)),test_size=0.5, random_state=15) #50% test
    folds_split = np.ones_like(y)
    folds_split[folds[0]] = -1
    ps = PredefinedSplit(folds_split)

    for tr, te in ps:
        train_X=X.iloc[tr,:]
        train_y=y[tr]
        test_X=X.iloc[te,:]
        test_y=y[te]
        rf=RandomForestRegressor(n_estimators=1000, n_jobs=1, random_state=15)
        rf.fit(train_X,train_y)
        y_submission=rf.predict(test_X)
        print("Validation Sample Score: {:.10f} (mean squared).".format(mean_squared_error(test_y, y_submission)))

    parameters= {'n_estimators': [1000], 'n_jobs': [1], 'random_state': [15]}
    grid_search = GridSearchCV(RandomForestRegressor(), param_grid=parameters,cv=ps, verbose=2, scoring='mean_squared_error', n_jobs=1)
    grid_search.fit(X,y)

    print("best_params: ", grid_search.best_params_)
    print("best_score", grid_search.best_score_)
    print("grid_scores", grid_search.grid_scores_)

Hope this helps a bit.

Sorry I can't figure out what's going on with your Gini scorer. I'd say 0.0033xxx seems like a very low value though (almost no model at all?) for a normalized gini score.

answered Sep 18 '22 22:09

ldirer

Following your minimal example and response from user3914041 and Andreus, this works as intended. Indeed, I got :

Validation Sample Score: 10.176958 (mean squared).
Fitting 1 folds for each of 1 candidates, totalling 1 fits
mean: 10.19074, std: 0.00000, params: {'n_estimators': 1000}

In this case we have the same result in both methodologies (omitting some rounding). Here is the code to reproduce the same scores:

from sklearn.cross_validation import train_test_split, PredefinedSplit
from sklearn.ensemble import RandomForestRegressor
import numpy as np
from sklearn import metrics
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.datasets import load_boston

b=load_boston()
X = b.data
y = b.target

folds=train_test_split(range(len(y)),test_size=0.5, random_state=10)
train_X=X[folds[0],:]
train_y=y[folds[0]]
test_X=X[folds[1],:]
test_y=y[folds[1]]

folds_split = np.zeros_like(y)
folds_split[folds[0]] = -1
ps = PredefinedSplit(folds_split)

rf=RandomForestRegressor(n_estimators=1000, random_state=42)
rf.fit(train_X,train_y)
y_submission=rf.predict(test_X)

print "Validation Sample Score: %f (mean squared)." % mean_squared_error(test_y,y_submission)

mse_scorer = make_scorer(mean_squared_error)
parameters= {'n_estimators': [1000]}
grid_search = GridSearchCV(RandomForestRegressor(random_state=42), cv=ps,
                           param_grid=parameters, verbose=1, scoring=mse_scorer)
grid_search.fit(X,y)

print grid_search.grid_scores_[0]

In your first example, try to remove greater_is_better=True. Indeed, the Gini coefficient is supposed to be minimized, not maximized.

Try to see if this solves the problem. You can also add some random seed to ensure your split are done in the exact same fashion.

answered Sep 19 '22 22:09

Challensois

Related questions
                            
                                How can I format a float with given precision and zero padding?
                            
                                how to Count the number of non zero pixels of the canny image in my python program
                            
                                Distinguish matches in pyparsing
                            
                                Apply function row wise on pandas data frame on columns with numerical values
                            
                                Exception gevent.hub.LoopExit: LoopExit('This operation would block forever',)
                            
                                Python for key, value in dictionary
                            
                                Same view with multiple URL patterns and optional arguments
                            
                                Combinatoric / cartesian product of Numpy arrays without iterators and/or loop(s) [duplicate]
                            
                                How to suppress the display of passwords?
                            
                                statsmodels summary to latex
                            
                                Extracting columns containing a certain name
                            
                                WeasyPrint: fixed footer tag overlapped by long table on each pdf page
                            
                                Python - store a string and an int using map(sys.stdin.readline())
                            
                                insert ignore pandas dataframe into mysql
                            
                                Extracting URL and anchor text from Markdown using Python
                            
                                SAWarning when querying with SQLAlchemy into pandas df
                            
                                Plotting the data with scrollable x (time/horizontal) axis on Linux
                            
                                What's difference between findall() and iterfind() of xml.etree.ElementTree
                            
                                Python Test If Point is in Rectangle
                            
                                Comparing date strings in python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Manual split versus Scikit Grid Search

Tags:

python

machine-learning

scikit-learn

B_Miner

People also ask

2 Answers

ldirer

Challensois

Recent Activity

Donate For Us