I'm working on the mushroom classification data set (found here: https://www.kaggle.com/uciml/mushroom-classification).
I'm trying to split my data into training and testing sets for my models, however if i use the train_test_split method my models always achieve 100% accuracy. This is not the case when i split my data manually.
x = data.copy()
y = x['class']
del x['class']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
This produces:
[[1299 0]
[ 0 1382]]
1.0
If I split the data manually I get a more reasonable result.
x = data.copy()
y = x['class']
del x['class']
x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]
model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
Result:
[[2007 0]
[ 336 337]]
0.8746268656716418
What could be causing this behaviour?
Edit: As per request I'm including shapes of slices.
train_test_split:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
Result:
(5443, 64)
(5443,)
(2681, 64)
(2681,)
Manual split:
x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
Result:
(5443, 64)
(5443,)
(2680, 64)
(2680,)
I've tried defining my own split function and the resulting split also results in 100% classifier accuracy.
Here's the code for the split
def split_data(dataFrame, testRatio):
dataCopy = dataFrame.copy()
testCount = int(len(dataFrame)*testRatio)
dataCopy = dataCopy.sample(frac = 1)
y = dataCopy['class']
del dataCopy['class']
return dataCopy[testCount:], dataCopy[0:testCount], y[testCount:], y[0:testCount]
You get an 100% accuracy on train and test set... But when you deploy your model... It may not perform as well because the images the model would be using may not be as clear as the training and test data. 100% test accuracy isn't bad but not a final performance metric...
1 accuracy does not equal 1% accuracy. Therefore 100 accuracy cannot represent 100% accuracy. If you don't have 100% accuracy then it is possible to miss. The accuracy stat represents the degree of the cone of fire.
This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array. We can demonstrate this with an example of a classification dataset with 94 examples in one class and six examples in a second class.
test_size . This parameter specifies the size of the testing dataset. The default state suits the training size. It will be set to 0.25 if the training size is set to default.
You got lucky there on your train_test_split. The split you are doing manually may be having the most unseen data, which is doing better validation than the train_test_split which internally shuffled the data to split it.
For better validation use K-fold cross validation, which will allow to verify the model accuracy with each of the different parts in your data as your test and rest part as train.
Your manual train test split does not have shuffle but scikit function has shuffle on by default. Split shapes are same but data is different.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Code:
import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(18).reshape((9, 2)), range(9)
print(X)
print(list(y))
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.33, random_state=42)
print("\nTraining with shuffle:")
print(X_train)
print(y_train)
print("\nTesting with shuffle:")
print(X_test)
print(y_test)
print("\nWithout Shuffle:")
tmp = train_test_split(X, y, test_size=0.33, shuffle=False)
print(tmp[0])
print(tmp[2])
print()
print(tmp[1])
print(tmp[3])
Output:
[[ 0 1]
[ 2 3]
[ 4 5]
[ 6 7]
[ 8 9]
[10 11]
[12 13]
[14 15]
[16 17]]
[0, 1, 2, 3, 4, 5, 6, 7, 8]
Training with shuffle:
[[ 0 1]
[16 17]
[ 4 5]
[ 8 9]
[ 6 7]
[12 13]]
[0, 8, 2, 4, 3, 6]
Testing with shuffle:
[[14 15]
[ 2 3]
[10 11]]
[7, 1, 5]
Without Shuffle:
[[ 0 1]
[ 2 3]
[ 4 5]
[ 6 7]
[ 8 9]
[10 11]]
[0, 1, 2, 3, 4, 5]
[[12 13]
[14 15]
[16 17]]
[6, 7, 8]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With