Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does cross validation RF classification perform worse than without cross validation?

I am puzzled why a Random Forest classification model without cross-validation yields a mean accuracy score of .996, but with 5 fold cross-validation, the model's mean accuracy score is .687.

There are 275,956 samples. Class 0 = 217891, class 1 = 6073, class 2 = 51992

I am trying to predict the "TARGET" column, which is 3 classes [0,1,2]:

data.head()
bottom_temperature  bottom_humidity top_temperature top_humidity    external_temperature    external_humidity   weight  TARGET  
26.35   42.94   27.15   40.43   27.19   0.0  0.0    1   
36.39   82.40   33.39   49.08   29.06   0.0  0.0    1   
36.32   73.74   33.84   42.41   21.25   0.0  0.0    1   

From the docs, the data is split into training and test

# link to docs http://scikit-learn.org/stable/modules/cross_validation.html
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import svm

# Create a list of the feature column's names
features = data.columns[:7]

# View features
features
Out[]: Index([u'bottom_temperature', u'bottom_humidity', u'top_temperature',
       u'top_humidity', u'external_temperature', u'external_humidity',
       u'weight'],
      dtype='object')


#split data
X_train, X_test, y_train, y_test = train_test_split(data[features], data.TARGET, test_size=0.4, random_state=0)

#build model
clf = RandomForestClassifier(n_jobs=2, random_state=0)
clf.fit(X_train, y_train)

#predict
preds = clf.predict(X_test)

#accuracy of predictions
accuracy = accuracy_score(y_test, preds)
print('Mean accuracy score:', accuracy)

('Mean accuracy score:', 0.96607267423425713)

#verify - its the same
clf.score(X_test, y_test)
0.96607267423425713

Onto the cross validation:

from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, data[features], data.TARGET, cv=5)
print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy: 0.69 (+/- 0.07)

It is much lower!

And to verify a second way:

#predict with CV
# http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html#sklearn.model_selection.cross_val_predict
from sklearn.model_selection import cross_val_predict
predicted = cross_val_predict(clf, data[features], data.queen3, cv=5)
metrics.accuracy_score(data.queen3, predicted) 

Out[]: 0.68741031178883594

From my understanding, Cross validation should not decrease the accuracy of predictions by this amount, but improve the model's prediction because the model has seen a "better" representation of all the data.

like image 519
Evan Avatar asked Dec 18 '22 00:12

Evan


2 Answers

Normally I would agree with Vivek and tell you to trust your cross-validation.

However, some level of CV is inherent in a random forest because each tree is grown from a bootstrapped sample, so you shouldn’t expect to see such a large reduction in accuracy when running cross-validation. I suspect your problem is due to some sort of time- or location-dependency in your data sorting.

When you use train_test_split, data is drawn randomly from the dataset, so all 80 of your environments are likely to be present in your train and test datasets. However, when you split using the default options for CV, I believe that each of the folds is drawn in order, so each of your environments is not present within every fold (assuming your data is ordered by environment). This leads to a lower accuracy because you are predicting one environment using data from another.

The simple solution is to set cv=ms.StratifiedKFold(n_splits=5, shuffle=True).

I have run into this problem several times before when using concatenated datasets and there must be hundreds of others who have and haven’t realised what the issue is. The idea of the default behaviour is to maintain order in a time series (from what I have seen in GitHub discussions).

like image 184
Stev Avatar answered Dec 26 '22 13:12

Stev


In the train_test_split, you are using 60% of data for training (test_size=0.4) only a single time. But in cross_val_score the the data will be splitted into 80% train (cv = 5) 5 times (Each time 4 folds will become the train and remaining 1 as test).

Now you should think that 80% training data is more than 60% so the accuracy should not decrease. But there is one more thing to notice here.

The train_test_split will not stratify the splits by default, but it will do in cross_val_score. Stratification is keeping the ratio of classes (targets) same in each fold. So most probably, its happening that the ratio of targets is not maintained in the train_test_split which leads to over-fitting of the classifier and hence this high score.

I would suggest to take the cross_val_score as final score.

like image 41
Vivek Kumar Avatar answered Dec 26 '22 12:12

Vivek Kumar