Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

100% classifier accuracy after using train_test_split

I'm working on the mushroom classification data set (found here: https://www.kaggle.com/uciml/mushroom-classification).

I'm trying to split my data into training and testing sets for my models, however if i use the train_test_split method my models always achieve 100% accuracy. This is not the case when i split my data manually.

x = data.copy()
y = x['class']
del x['class']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)

print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

This produces:

[[1299    0]
 [   0 1382]]
1.0

If I split the data manually I get a more reasonable result.

x = data.copy()
y = x['class']
del x['class']

x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]

model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)

print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

Result:

[[2007    0]
 [ 336  337]]
0.8746268656716418

What could be causing this behaviour?

Edit: As per request I'm including shapes of slices.

train_test_split:

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

Result:

(5443, 64)
(5443,)
(2681, 64)
(2681,)

Manual split:

x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

Result:

(5443, 64)
(5443,)
(2680, 64)
(2680,)

I've tried defining my own split function and the resulting split also results in 100% classifier accuracy.

Here's the code for the split

def split_data(dataFrame, testRatio):
  dataCopy = dataFrame.copy()
  testCount = int(len(dataFrame)*testRatio)
  dataCopy = dataCopy.sample(frac = 1)
  y = dataCopy['class']
  del dataCopy['class']
  return dataCopy[testCount:], dataCopy[0:testCount], y[testCount:], y[0:testCount]
like image 231
Don Andre Avatar asked Jan 27 '20 19:01

Don Andre


People also ask

Is 100 accuracy possible in machine learning?

You get an 100% accuracy on train and test set... But when you deploy your model... It may not perform as well because the images the model would be using may not be as clear as the training and test data. 100% test accuracy isn't bad but not a final performance metric...

Can accuracy be more than 100?

1 accuracy does not equal 1% accuracy. Therefore 100 accuracy cannot represent 100% accuracy. If you don't have 100% accuracy then it is possible to miss. The accuracy stat represents the degree of the cone of fire.

Why is it important to use a train_test_split in model building?

This will be used by the train_test_split() function to ensure that both the train and test sets have the proportion of examples in each class that is present in the provided “y” array. We can demonstrate this with an example of a classification dataset with 94 examples in one class and six examples in a second class.

What is the default value of Test_size in train_test_split () when both?

test_size . This parameter specifies the size of the testing dataset. The default state suits the training size. It will be set to 0.25 if the training size is set to default.


2 Answers

You got lucky there on your train_test_split. The split you are doing manually may be having the most unseen data, which is doing better validation than the train_test_split which internally shuffled the data to split it.

For better validation use K-fold cross validation, which will allow to verify the model accuracy with each of the different parts in your data as your test and rest part as train.

like image 177
Desmond Avatar answered Oct 21 '22 03:10

Desmond


Your manual train test split does not have shuffle but scikit function has shuffle on by default. Split shapes are same but data is different.

https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

Code:

import numpy as np
from sklearn.model_selection import train_test_split
X, y = np.arange(18).reshape((9, 2)), range(9)
print(X)
print(list(y))
X_train, X_test, y_train, y_test = train_test_split(
     X, y, test_size=0.33, random_state=42)

print("\nTraining with shuffle:")
print(X_train)
print(y_train)


print("\nTesting with shuffle:")
print(X_test)
print(y_test)


print("\nWithout Shuffle:")
tmp = train_test_split(X, y, test_size=0.33, shuffle=False)
print(tmp[0])
print(tmp[2])
print()
print(tmp[1])
print(tmp[3])

Output:

[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]
 [12 13]
 [14 15]
 [16 17]]
[0, 1, 2, 3, 4, 5, 6, 7, 8]

Training with shuffle:
[[ 0  1]
 [16 17]
 [ 4  5]
 [ 8  9]
 [ 6  7]
 [12 13]]
[0, 8, 2, 4, 3, 6]

Testing with shuffle:
[[14 15]
 [ 2  3]
 [10 11]]
[7, 1, 5]

Without Shuffle:
[[ 0  1]
 [ 2  3]
 [ 4  5]
 [ 6  7]
 [ 8  9]
 [10 11]]
[0, 1, 2, 3, 4, 5]

[[12 13]
 [14 15]
 [16 17]]
[6, 7, 8]
like image 42
B200011011 Avatar answered Oct 21 '22 02:10

B200011011