How training and test data is split - Keras on Tensorflow

Tags:

I am currently training my data using neural network and using fit function.

history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)

Now I have used validation_split as 20%. What I understood is that my training data will be 80% and testing data will be 20%. I am confused how this data is dealt on back end. Is it like top 80% samples will be taken for training and below 20% percent for testing or the samples are randomly picked from inbetween? If I want to give separate training and testing data, how will I do that using fit()??

Moreover, my second concern is how to check if data is fitting well on model? I can see from the results that training accuracy is around 90% while the validation accuracy is around 55%. Does this mean it is the case of over-fitting or Under-fitting?

My last question is what does evaluate returns? Document says it returns the loss but I am already getting loss and accuracy during each epoch (as a return of fit() (in history)). What does accuracy and score returned by evaluate shows? If the accuracy returned by evaluate returns 90%, can I say my data is fitting well, regardless of what individual accuracy and loss was for each epoch?

Below is my Code:

import numpy
import pandas
import matplotlib.pyplot as plt
from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from keras.utils import np_utils
from sklearn.model_selection import KFold
from sklearn.metrics import confusion_matrix
import itertools

seed = 7
numpy.random.seed(seed)

dataframe = pandas.read_csv("INPUTFILE.csv", skiprows=range(0, 0))

dataset = dataframe.values
X = dataset[:,0:50].astype(float) # number of cols-1
Y = dataset[:,50]

encoder = LabelEncoder()
encoder.fit(Y)
encoded_Y = encoder.transform(Y)

encoded_Y = np_utils.to_categorical(encoded_Y)
print("encoded_Y=", encoded_Y) 
# baseline model
def create_baseline():
    # create model
    model = Sequential()
    model.add(Dense(5, input_dim=5, kernel_initializer='normal', activation='relu'))
    model.add(Dense(5, kernel_initializer='normal', activation='relu'))
    #model.add(Dense(2, kernel_initializer='normal', activation='sigmoid'))

    model.add(Dense(2, kernel_initializer='normal', activation='softmax'))

    # Compile model
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])  # for binayr classification
        #model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])  # for multi class
    return model


model=create_baseline();
history=model.fit(X, encoded_Y, batch_size=50, nb_epoch=500, validation_split = 0.2, verbose=1)

print(history.history.keys())
# list all data in history
print(history.history.keys())
# summarize history for accuracy
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()


pre_cls=model.predict_classes(X)    
cm1 = confusion_matrix(encoder.transform(Y),pre_cls)
print('Confusion Matrix : \n')
print(cm1)


score, acc = model.evaluate(X,encoded_Y)
print('Test score:', score)
print('Test accuracy:', acc)

571

asked Jun 24 '18 02:06

eshaa

1 Answers

The keras documentation says:"The validation data is selected from the last samples in the x and y data provided, before shuffling.", this means that the shuffle occurs after the split, there is also a boolean parameter called "shuffle" which is set true as default, so if you don't want your data to be shuffled you could just set it to false
Getting good results on your training data and then getting bad or not so good results on your evaluation data usually means that your model is overfitting, overfit is when your model learns in a very specific scenario and can't achieve good results on new data
evaluation is to test your model on new data that it has "never seen before", usually you divide your data on training and test, but sometimes you might also want to create a third group of data, because if you just adjust your model to obtain better and better results on your test data this in some way is like cheating because in some way you are telling your model how is the data you are going to use for evaluation and this could cause overfitting

Also, if you want to split your data without using keras, I recommend you to use the sklearn train_test_split() function.

it's easy to use and it looks like this:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

answered Oct 17 '22 07:10

sebrojas

Related questions
                            
                                Validating a time zone is valid in rails
                            
                                How can I set a RegularExpression data annotation's regular expression argument at runtime?
                            
                                Check preconditions in Controller or Service layer
                            
                                jQuery - simple input validation - "empty" and "not empty"
                            
                                FILTER_FLAG_STRIP_LOW vs FILTER_FLAG_STRIP_HIGH?
                            
                                How do I force ASP:TextBox to be of type email?
                            
                                Modify input before validation on Laravel 5.1
                            
                                Check VAT number for syntactical correctness with Regex possible?
                            
                                How do I validate an Australian Medicare number?
                            
                                How to get the maximum length configured in an ActiveRecord validation?
                            
                                GWT - Module.gwt.xml - XML validation warning
                            
                                VIN Validation RegEx
                            
                                Validation on optional Parameter using class-validator in nestjs?
                            
                                ASP.NET MVC datetime culture issue when passing value back to controller
                            
                                How to validate a Twitter username using Regex
                            
                                How to debounce async validator in Angular 4 with RxJS observable?
                            
                                DataAnnotations - Disallow Numbers, or only allow given strings
                            
                                Laravel string validation to allow empty strings
                            
                                Using jQuery validate plugin: onfocusout, onkeyup notworking as expected on production site
                            
                                Sklearn preprocessing - PolynomialFeatures - How to keep column names/headers of the output array / dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How training and test data is split - Keras on Tensorflow

Tags:

validation

machine-learning

neural-network

tensorflow

keras

eshaa

People also ask

1 Answers

sebrojas

Recent Activity

Donate For Us