Different accuracy when splitting data with train_test_split than loading csv file afterwards

Question

I have build a model to predict if a customer is a business or a private customer. After training the model I predict the class of 1000 datasets which I didn’t use for the training. This prediction will be saved in a csv file. Now I have two different behaviours:

Splitting sample data in the program

When I create the sample with train, sample = train_test_split(train, test_size=1000, random_state=seed) then prediction gets the same accuracy during the training (same value as validation).

Splitting sample data in advance and then loading it

But when I split the data manually before learning by taking 1000 datasets of the original csv file and copying it in a new sample csv file which I am loading before doing the prediction after learning, I got a much worse result (e.g. 76% instead of 90%). This behaviour doesn’t make sense in my eyes since the original data (the csv file for training) was also shuffled in advanced and therefore I should get the same result. Here is the relevant code of the mentioned case distinction:

1. Splitting sample data in the program

Splitting

def getPreProcessedDatasetsWithSamples(filepath, batch_size):
    path = filepath
    data = __getPreprocessedDataFromPath(path) 
    
    train, test = train_test_split(data, test_size=0.2, random_state=42)
    train, val = train_test_split(train, test_size=0.2, random_state=42)
    train, sample = train_test_split(train, test_size=1000, random_state=seed)

    train_ds = __df_to_dataset(train, shuffle=False, batch_size=batch_size)
    val_ds = __df_to_dataset(val, shuffle=False, batch_size=batch_size)
    test_ds = __df_to_dataset(test, shuffle=False, batch_size=batch_size)
    sample_ds = __df_to_dataset(sample, shuffle=False, batch_size=batch_size)

    return (train_ds, val_ds, test_ds, sample, sample_ds)

Prediction with sample, sample_ds

def savePredictionWithSampleToFileKeras(model, outputName, sample, sample_ds):
    predictions = model.predict(sample_ds)
    loss, accuracy = model.evaluate(sample_ds)


    print("Accuracy of sample", accuracy)


    sample['prediction'] = predictions
    sample.to_csv("./saved_samples/" + outputName + ".csv")

Accuracy of sample: 90%

2. Splitting sample data in advance and then loading it

Prediction by loading csv file

def savePredictionToFileKeras(model, sampleFilePath, outputName, batch_size):
    sample_ds = preprocessing.getPreProcessedSampleDataSets(sampleFilePath, batch_size)
    sample = preprocessing.getPreProcessedSampleDataFrames(sampleFilePath)

    predictions = model.predict(sample_ds)
    loss, accuracy = model.evaluate(sample_ds)

    print("Accuracy of sample", accuracy)

    sample['prediction'] = predictions
    sample.to_csv("./saved_samples/" + outputName + ".csv")

Accuracy of sample: 77%

EDIT

Observation: When I load the whole data as sample data, I get the same value as the validation value as expected (ca. 90%) but when I just randomize the line order of the same file, I get a value of 82%. As my understanding the accuracy should be the same, since the files are equal.

Some additional information: I have changed the implementation form the sequential to the functional API. I’m using Embeddings in the pre-processing (I also tried One-Hot-Encoding without success).

Ling · Accepted Answer

Finally I found the problem: I am using a Tokenizer to preprocess a NAME and STREET column in a way that I am converting each word to a value which indicates how often the word occurs. In the case I am using train_test_split I use the same overall words of all data for converting the words, but when I am loading the sample dataset afterwards I use only the words which occurs in the sample dataset. For instance, the word “family” could be the most used word overall but just the third in the sample dataset and therefore the encoding would be totally wrong. After using the same tokenizer instance for all data, I get the same high accuracy for all the data.

Different accuracy when splitting data with train_test_split than loading csv file afterwards

Tags:

python

machine-learning

tensorflow

classification

keras

1. Splitting sample data in the program

2. Splitting sample data in advance and then loading it

Ling

1 Answers

Ling

Recent Activity

Donate For Us

Different accuracy when splitting data with train_test_split than loading csv file afterwards

Tags:

python

machine-learning

tensorflow

classification

keras

1. Splitting sample data in the program

2. Splitting sample data in advance and then loading it

Ling

1 Answers

Ling

Related questions

Recent Activity

Donate For Us