RandomizedSearchCV gives different results using the same random_state

Tags:

I am using a pipeline to perform feature selection and hyperparameter optimization using RandomizedSearchCV. Here is a summary of the code:

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.grid_search import RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from scipy.stats import randint as sp_randint

rng = 44

X_train, X_test, y_train, y_test = 
   train_test_split(data[features], data['target'], random_state=rng)


clf = RandomForestClassifier(random_state=rng)
kbest = SelectKBest()
pipe = make_pipeline(kbest,clf)

upLim = X_train.shape[1]
param_dist = {'selectkbest__k':sp_randint(upLim/2,upLim+1),
  'randomforestclassifier__n_estimators': sp_randint(5,150),
  'randomforestclassifier__max_depth': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, None],
  'randomforestclassifier__criterion': ["gini", "entropy"],
  'randomforestclassifier__max_features': ['auto', 'sqrt', 'log2']}
clf_opt = RandomizedSearchCV(pipe, param_distributions= param_dist, 
                             scoring='roc_auc', n_jobs=1, cv=3, random_state=rng)
clf_opt.fit(X_train,y_train)
y_pred = clf_opt.predict(X_test)

I am using a constant random_state for the train_test_split, RandomForestClassifer, and RandomizedSearchCV. However, the result of the above code is slightly different if I run it several times. More specifically, I have several test units in my code and these slightly different results leads to failure of the test units. Should not I obtain the same results because of using the same random_state? Am I missing anything in my code that creates randomness in a part of the code?

835

asked Jan 06 '17 23:01

MhFarahani

1 Answers

I usually answer my own questions! I will leave it here for others with similar question:

To make sure that I am avoiding any randomness, I defined a random seed. The code is as follows:

from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.grid_search import RandomizedSearchCV
from sklearn.pipeline import make_pipeline
from scipy.stats import randint as sp_randint

seed = np.random.seed(22)

X_train, X_test, y_train, y_test = 
   train_test_split(data[features], data['target'])


clf = RandomForestClassifier()
kbest = SelectKBest()
pipe = make_pipeline(kbest,clf)

upLim = X_train.shape[1]
param_dist = {'selectkbest__k':sp_randint(upLim/2,upLim+1),
  'randomforestclassifier__n_estimators': sp_randint(5,150),
  'randomforestclassifier__max_depth': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, None],
  'randomforestclassifier__criterion': ["gini", "entropy"],
  'randomforestclassifier__max_features': ['auto', 'sqrt', 'log2']}
clf_opt = RandomizedSearchCV(pipe, param_distributions= param_dist, 
                             scoring='roc_auc', n_jobs=1, cv=3)
clf_opt.fit(X_train,y_train)
y_pred = clf_opt.predict(X_test)

I hope it can help others!

answered Oct 02 '22 23:10

MhFarahani

Related questions
                            
                                In BeautifulSoup, Ignore Children Elements While Getting Parent Element Data
                            
                                Google Drive API - ImportError: cannot import name util
                            
                                pandas replace part of a column with another column
                            
                                Why python bulit-in functions such as sum(),max(),min() can be used to calculate the numpy's datatype ndarray?
                            
                                Which is the more efficient way to choose a random pair of objects from a list of lists or tuples?
                            
                                Cannot catch ConnectionError with requests
                            
                                Check if mail is read, gmail api
                            
                                Flask RestPlus inherit model doesn't work as expected
                            
                                How to compare tensor inside tensorflow?
                            
                                single-step simulation in tensorflow RNN
                            
                                Debug python application running in Docker
                            
                                Unable to import opencv in Jupyter notebook but able to import in command line on Anaconda
                            
                                Slice MultiIndex pandas DataFrame by position
                            
                                Read Directory of Timeseries CSV data efficiently with Dask DataFrame and Pandas
                            
                                How to do linear regression using Python and Scikit learn using one hot encoding?
                            
                                Match same number of repetitions of character as repetitions of captured group
                            
                                Pandas DataFrame to Excel: Vertical Alignment of Index
                            
                                Invoking the lock screen using python
                            
                                Scrapy - Continuously fetch urls to crawl from database
                            
                                Large memory Python background jobs

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

RandomizedSearchCV gives different results using the same random_state

Tags:

python

machine-learning

random-seed

scikit-learn

grid-search

MhFarahani

People also ask

1 Answers

MhFarahani

Recent Activity

Donate For Us