I have build a model to predict if a customer is a business or a private customer. After training the model I predict the class of 1000 datasets which I didn’t use for the training. This prediction will be saved in a csv file. Now I have two different behaviours:
When I create the sample with train, sample = train_test_split(train, test_size=1000, random_state=seed) then prediction gets the same accuracy during the training (same value as validation).
But when I split the data manually before learning by taking 1000 datasets of the original csv file and copying it in a new sample csv file which I am loading before doing the prediction after learning, I got a much worse result (e.g. 76% instead of 90%). This behaviour doesn’t make sense in my eyes since the original data (the csv file for training) was also shuffled in advanced and therefore I should get the same result. Here is the relevant code of the mentioned case distinction:
Splitting
def getPreProcessedDatasetsWithSamples(filepath, batch_size):
path = filepath
data = __getPreprocessedDataFromPath(path)
train, test = train_test_split(data, test_size=0.2, random_state=42)
train, val = train_test_split(train, test_size=0.2, random_state=42)
train, sample = train_test_split(train, test_size=1000, random_state=seed)
train_ds = __df_to_dataset(train, shuffle=False, batch_size=batch_size)
val_ds = __df_to_dataset(val, shuffle=False, batch_size=batch_size)
test_ds = __df_to_dataset(test, shuffle=False, batch_size=batch_size)
sample_ds = __df_to_dataset(sample, shuffle=False, batch_size=batch_size)
return (train_ds, val_ds, test_ds, sample, sample_ds)
Prediction with sample, sample_ds
def savePredictionWithSampleToFileKeras(model, outputName, sample, sample_ds):
predictions = model.predict(sample_ds)
loss, accuracy = model.evaluate(sample_ds)
print("Accuracy of sample", accuracy)
sample['prediction'] = predictions
sample.to_csv("./saved_samples/" + outputName + ".csv")
Accuracy of sample: 90%
Prediction by loading csv file
def savePredictionToFileKeras(model, sampleFilePath, outputName, batch_size):
sample_ds = preprocessing.getPreProcessedSampleDataSets(sampleFilePath, batch_size)
sample = preprocessing.getPreProcessedSampleDataFrames(sampleFilePath)
predictions = model.predict(sample_ds)
loss, accuracy = model.evaluate(sample_ds)
print("Accuracy of sample", accuracy)
sample['prediction'] = predictions
sample.to_csv("./saved_samples/" + outputName + ".csv")
Accuracy of sample: 77%
EDIT
Observation: When I load the whole data as sample data, I get the same value as the validation value as expected (ca. 90%) but when I just randomize the line order of the same file, I get a value of 82%. As my understanding the accuracy should be the same, since the files are equal.
Some additional information: I have changed the implementation form the sequential to the functional API. I’m using Embeddings in the pre-processing (I also tried One-Hot-Encoding without success).
Finally I found the problem: I am using a Tokenizer to preprocess a NAME and STREET column in a way that I am converting each word to a value which indicates how often the word occurs. In the case I am using train_test_split I use the same overall words of all data for converting the words, but when I am loading the sample dataset afterwards I use only the words which occurs in the sample dataset. For instance, the word “family” could be the most used word overall but just the third in the sample dataset and therefore the encoding would be totally wrong.
After using the same tokenizer instance for all data, I get the same high accuracy for all the data.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With