Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

RandomizedSearchCV gives different results using the same random_state

I am using a pipeline to perform feature selection and hyperparameter optimization using RandomizedSearchCV. Here is a summary of the code:

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.grid_search import RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from scipy.stats import randint as sp_randint

rng = 44

X_train, X_test, y_train, y_test = 
   train_test_split(data[features], data['target'], random_state=rng)


clf = RandomForestClassifier(random_state=rng)
kbest = SelectKBest()
pipe = make_pipeline(kbest,clf)

upLim = X_train.shape[1]
param_dist = {'selectkbest__k':sp_randint(upLim/2,upLim+1),
  'randomforestclassifier__n_estimators': sp_randint(5,150),
  'randomforestclassifier__max_depth': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, None],
  'randomforestclassifier__criterion': ["gini", "entropy"],
  'randomforestclassifier__max_features': ['auto', 'sqrt', 'log2']}
clf_opt = RandomizedSearchCV(pipe, param_distributions= param_dist, 
                             scoring='roc_auc', n_jobs=1, cv=3, random_state=rng)
clf_opt.fit(X_train,y_train)
y_pred = clf_opt.predict(X_test)

I am using a constant random_state for the train_test_split, RandomForestClassifer, and RandomizedSearchCV. However, the result of the above code is slightly different if I run it several times. More specifically, I have several test units in my code and these slightly different results leads to failure of the test units. Should not I obtain the same results because of using the same random_state? Am I missing anything in my code that creates randomness in a part of the code?

like image 835
MhFarahani Avatar asked Jan 06 '17 23:01

MhFarahani


People also ask

What does random state mean in train test split?

The random state hyperparameter in the train_test_split() function controls the shuffling process. With random_state=None , we get different train and test sets across different executions and the shuffling process is out of control. With random_state=0 , we get the same train and test sets across different executions.

Which random_state value is preferred while using train_test_split function?

train_test_split), is recommended to used the parameter ( random_state=42) to produce the same results across a different run.

What is random state in random forest classifier?

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

What does random_state 42 mean?

42 is the Answer to the Ultimate Question of Life, the Universe, and Everything. On a serious note, random_state simply sets a seed to the random generator, so that your train-test splits are always deterministic.


1 Answers

I usually answer my own questions! I will leave it here for others with similar question:

To make sure that I am avoiding any randomness, I defined a random seed. The code is as follows:

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.grid_search import RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from scipy.stats import randint as sp_randint

seed = np.random.seed(22)

X_train, X_test, y_train, y_test = 
   train_test_split(data[features], data['target'])


clf = RandomForestClassifier()
kbest = SelectKBest()
pipe = make_pipeline(kbest,clf)

upLim = X_train.shape[1]
param_dist = {'selectkbest__k':sp_randint(upLim/2,upLim+1),
  'randomforestclassifier__n_estimators': sp_randint(5,150),
  'randomforestclassifier__max_depth': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, None],
  'randomforestclassifier__criterion': ["gini", "entropy"],
  'randomforestclassifier__max_features': ['auto', 'sqrt', 'log2']}
clf_opt = RandomizedSearchCV(pipe, param_distributions= param_dist, 
                             scoring='roc_auc', n_jobs=1, cv=3)
clf_opt.fit(X_train,y_train)
y_pred = clf_opt.predict(X_test)

I hope it can help others!

like image 75
MhFarahani Avatar answered Oct 02 '22 23:10

MhFarahani